LLMs Generate, Don't Refactor

You ask an LLM to write ten tests. You get ten tests that work. Each one duplicates the same setup code. Six months later, changing the object structure means updating ten files instead of one helper method. This is how you get 28.7% test duplication—not from mistakes, but from working code that never got refactored.

This came from a Python refactoring session where LLM-generated code hit maintenance problems. The tests passed, the features shipped, but the codebase became expensive to maintain. The problem isn’t that LLMs write bad code—they write working code. The problem is that working code needs continuous refactoring to stay maintainable, and LLMs don’t do that.

LLMs Generate, They Don’t Refactor

LLMs operate as copy-paste platforms. You ask for a test, you get a test. You ask for ten tests, you get ten tests. Each one works independently. Each one is self-contained. Each one duplicates setup code, object creation, and teardown logic.

This is fine for test 1. It’s still fine for test 2. By test 10, you have 28.7% duplication, and changing the object structure means updating ten test files.

A human writes test 1, then test 2, notices the duplication, and extracts a helper method before writing test 3. This isn’t perfectionism—it’s limiting the cost of future changes. Change the object structure once in the helper, not ten times across tests.

LLMs don’t do this. They generate all ten tests in one pass without reflection, without noticing that tests 3-10 look suspiciously like tests 1-2, without extracting patterns—just generation.

Good software requires constant refactoring to limit the cost of change. You write concrete code, notice patterns, extract abstractions, then use those abstractions. This cycle—concrete, notice, abstract, use—repeats continuously as the codebase evolves.

LLMs skip the middle steps. They generate concrete code and stop—there’s no step where they notice patterns or extract abstractions. The pattern never emerges because no one is looking for it.

Abstractions Emerge, They’re Not Predicted

You don’t know upfront which abstractions you’ll need. You discover them by writing concrete code and noticing repetition. The first time you write something, you make it work. The second time, you notice similarity. The third time, you extract the pattern.

This is YAGNI (You Aren’t Gonna Need It) applied correctly: don’t abstract until the pattern reveals itself through actual usage.

LLMs can’t do this because they generate code in one shot. They don’t see the progression from concrete to abstract. They don’t notice that three similar pieces of code should become one parameterized function.

Example from the refactoring: polymorphic behavior tests. Two test methods, identical structure, different inputs:

def test_persistence_stores_type_a_correctly(self):
    item = create_type_a()
    repository.save(item)
    retrieved = repository.get(item.id)
    assert isinstance(retrieved, TypeA)
    assert retrieved.category == Category.TYPE_A

def test_persistence_stores_type_b_correctly(self):
    item = create_type_b()
    repository.save(item)
    retrieved = repository.get(item.id)
    assert isinstance(retrieved, TypeB)
    assert retrieved.category == Category.TYPE_B

These aren’t two behaviors. This is one behavior—polymorphic persistence—with multiple test cases.

The refactored version:

@dataclass
class PersistenceTestCase:
    name: str
    item: DomainObject
    expected_type: type
    expected_category: Category

def test_persistence_stores_all_types_correctly(self):
    cases = [
        PersistenceTestCase("type_a", create_type_a(), TypeA, Category.TYPE_A),
        PersistenceTestCase("type_b", create_type_b(), TypeB, Category.TYPE_B),
    ]

    for case in cases:
        with self.subTest(case=case.name):
            repository.save(case.item)
            retrieved = repository.get(case.item.id)
            assert isinstance(retrieved, case.expected_type)
            assert retrieved.category == case.expected_category

One test method. Multiple test cases. The structure makes explicit what’s being tested: polymorphic behavior across types.

The LLM didn’t recognize these as instances of the same behavior. It generated two complete test methods because that’s what you do when someone asks for “test type A persistence” and “test type B persistence”—two requests means two tests, with no pattern recognition in between.

A human writes the first test, starts writing the second, realizes they’re identical except for inputs, and refactors before finishing. The LLM finishes both, ships them, and moves to the next task.

The Optional Behavior Problem

The second testing issue is subtler but more damaging to maintainability: testing optional behavior without a baseline.

A cursor filter was added to a query executor. Queries could now optionally filter results based on a cursor. The LLM generated tests for the cursor filter:

def test_executor_with_cursor_returns_filtered_results(self):
    items = create_items(5)
    cursor = encode_cursor(items[1])
    results = executor.query(cursor=cursor)
    assert len(results) == 3

This test passes. It verifies the cursor filter works. But it doesn’t show what the cursor filter does.

You get 3 results with a cursor. How many do you get without a cursor? What did the filter actually filter out? Without the baseline, the test verifies behavior but doesn’t explain it.

The fix requires two tests: baseline and feature.

def test_executor_without_cursor_returns_all_items(self):
    """Baseline: no filtering."""
    create_items(5)
    results = executor.query(cursor=None)
    assert len(results) == 5

def test_executor_with_cursor_returns_items_after_cursor(self):
    """Filter excludes items at or before cursor."""
    items = create_items(5)
    cursor = encode_cursor(items[1])
    results = executor.query(cursor=cursor)
    assert len(results) == 3  # items[2], items[3], items[4]

Now it’s explicit:

Without cursor: 5 results
With cursor at position 1: 3 results
Filter excluded: items 0 and 1

This isn’t about completeness for its own sake. It’s about documentation. Six months from now when you need to modify the cursor filter, these tests tell you what it currently does. The baseline test documents the default behavior. The feature test documents what changes. The contrast between them documents the delta.

LLMs generate tests that verify features work. They don’t generate tests that explain how features work. The difference is empathy—understanding what will confuse a future reader and addressing it proactively.

For optional behavior (filters, flags, configuration), you need both states tested explicitly: on and off, with and without, enabled and disabled. Testing only the “on” state verifies the feature but doesn’t demonstrate its purpose.

Why This Breaks Down

The cost of not refactoring compounds. The first time you change that domain object structure and update ten test files, it’s annoying. The tenth time, it’s a multi-hour refactoring project that you keep deferring because it’s too expensive.

Tests that don’t explain behavior fail when you need to modify that behavior. You read the test, it shows “with cursor, get 3 results,” and you have no idea if that’s because the cursor filters out 2 items or because the dataset only had 3 items after the cursor. You either guess (risky) or dig through the implementation to reverse-engineer what the test should have told you (expensive).

This is how LLM-generated code creates technical debt that looks like feature velocity. Features ship fast because the LLM generates working code quickly. Then you slow down because every change fights duplication, unclear tests, and missing documentation. The codebase works but resists change.

What LLMs Can’t Do

LLMs can’t notice patterns across code they’ve already generated. They generate test 1, then generate test 2, but they don’t look back at test 1 while generating test 2 to notice they’re identical.

LLMs can’t refactor toward abstractions. They generate concrete solutions. Abstractions require looking at multiple concrete solutions, identifying the commonality, and extracting it. That’s a multi-step process that doesn’t happen in a single generation pass.

LLMs don’t optimize for future readers. They optimize for “does this work right now?” Tests that verify behavior but don’t explain it work perfectly well at generation time. They fail months later when someone needs to understand what the code does.

This isn’t a capability gap that’s about to close. These are structural limitations of how LLMs operate: generate text that matches patterns, don’t reflect on what you generated, don’t iteratively improve it. The model isn’t running in a loop where it generates code, reviews it, notices duplication, and refactors. It generates code once and stops.

What Changes

The workflow that emerged from this refactoring:

LLM generates working code (fast, consistent)
Duplication metrics flag problems (automated tools)
Human identifies structural issues (architectural review)
Human provides refactoring direction (extract helpers, parameterize tests, add baseline)
LLM applies the pattern across the codebase (mechanical work)
Repeat

This works because it uses LLMs for what they’re good at—generating consistent code from clear specifications—while keeping humans responsible for recognizing patterns and making architectural decisions.

The human role shifts from writing code to maintaining code quality through:

Recognizing when concrete code should become abstract
Identifying which tests are really test cases for the same behavior
Ensuring tests document, not just verify
Limiting the cost of future changes through continuous refactoring

LLMs amplify both good patterns and bad patterns. If your development process includes continuous refactoring, you can use LLMs for initial generation then refactor the output. If your process is “write code, ship it, move to the next feature,” LLMs will generate working code that becomes expensive to maintain.

The counterintuitive result: LLMs don’t reduce the need for software engineering discipline. They increase it. You need engineers who recognize duplication, extract abstractions, and design test strategies that serve future readers. LLMs make the initial generation faster, but they don’t make the ongoing maintenance easier.

Good software evolves through continuous small refactorings that limit the cost of change. LLMs don’t refactor, and that’s not a bug to be fixed in the next version—it’s a fundamental consequence of how they work as copy-paste platforms rather than iterative development tools.

The tests passed and the features shipped, but the codebase became more expensive to change over time. Working code that resists modification isn’t the goal. The goal is code that stays cheap to modify as requirements evolve.