Chapter 10: Safe Legacy Code Refactoring — Horror Stories and the Discipline That Prevents Them

This is Part 10 of a series walking through the book Master Claude Chat, Cowork and Code — From Prompting to Operational AI. In the previous chapter, we covered Claude Code fundamentals — the CLI architecture, multi-file refactoring, Git worktrees, and permission management. Now comes the chapter I think every developer using AI tools needs to read: what happens when AI refactoring goes wrong, and the disciplined methodology that prevents it.


The Horror Stories You Need to Hear

Chapter 10 opens with something unusual for a technical book: war stories. Not hypothetical risks, but the kinds of bugs that actually happen when AI refactors code without sufficient guardrails. These stories share a common pattern — the refactored code is more elegant, more efficient, or cleaner. And the bugs are subtle enough to pass every happy-path test.

The Silent Authorization Bypass: A developer asked an AI to refactor authentication middleware. The AI consolidated validation logic, removing a "redundant" permission check. It was redundant under normal circumstances — but worked as a safety backstop. When a race condition in a different service broke the primary check, there was no fallback. Production security breach.

The Race Condition Amplifier: An AI refactoring changed how a transaction was managed in Node.js, expanding a race condition window from microseconds to milliseconds. All tests passed — they run serially. The bug only manifested under load with concurrent users.

The Implicit Contract Breaker: A module returned null for "not found" and undefined for "error retrieving" — undocumented behavior. AI refactoring consolidated both to return null. The consuming code depended on the distinction. Bug manifested as occasional unexplained error handling in dependent modules.

The solution, the book argues, is not to avoid AI refactoring — it's to use disciplined practices that catch these bugs before they reach production.


Characterization Tests: The Safety Net Before You Refactor

This is the core methodology of Chapter 10, and it changed how I think about AI-assisted refactoring. Before touching any legacy code, you generate characterization tests — tests that don't verify the code is correct, but that capture its actual behavior, including edge cases, error conditions, and implementation quirks.

The workflow is elegant: use Claude Code to analyze existing code and generate comprehensive test coverage. Run those tests against the current code (they should all pass). Perform the refactoring. Run the tests again. If any test fails, you've caught a behavioral change — which is precisely what you're protecting against.

The book walks through a concrete example with a legacy payment processor. The code has several implicit behaviors: invalid amounts return null, failed card charges return undefined, successful charges return an object with a specific shape, exceptions return a different object shape. None of this is documented.

Key Idea from the Book: You invoke Claude Code to generate the characterization tests: claude-code "Generate comprehensive test coverage for src/payment/processor.js that characterizes all current behavior including edge cases, error conditions, and return value shapes". Claude generates tests for zero amounts, negative amounts, null amounts, successful payments, failed charges, and exceptions — each verifying the exact current return value.

I will not reproduce the full test suite from the book, but it covers every implicit behavior in the payment processor — the kind of behaviors that AI refactoring is most likely to break. With these tests in place, any behavioral change during refactoring is immediately caught.


Incremental Refactoring: Small Steps, Continuous Verification

Rather than asking Claude Code to refactor an entire legacy module at once, Chapter 10 advocates an incremental approach. Break the refactoring into smaller, verifiable steps. After each step, run your characterization tests. This makes it trivial to identify which change caused an issue.

The book demonstrates this with the payment processor, breaking it into three focused steps:

Step 1: claude-code "Refactor input validation in src/payment/processor.js to a separate function, improving clarity without changing behavior"

Step 2: claude-code "Create a separate error handling function to consolidate error cases, but maintain exact behavior"

Step 3: claude-code "Move logging calls to a separate transaction logging module, maintaining exact behavior"

Each step is small enough to review carefully. If any step breaks a test, the commit history shows exactly which refactoring caused it. The accumulation of small verified steps eventually achieves the full refactoring without the risk of large-scale changes.

Key Idea from the Book: Between each step, run your full test suite: npm test && npm run lint && npm run type-check. The discipline of verifying after every small change is what makes AI-assisted refactoring safe. Skip this step and you're gambling.

Reviewing AI-Generated Pull Requests

When Claude Code generates changes through CI/CD, those changes appear as pull requests. The book makes a crucial point: reviewing AI-generated code requires different skills than reviewing human-written code. You're not looking for style preferences. You're looking for subtle behavioral changes, security issues, and architectural violations.

The systematic review checklist covers seven areas: scope (did Claude only modify what was requested?), new dependencies, error handling preservation, test coverage, removed code, external service calls, and concurrency implications.

The book provides a particularly sharp example of a red flag in an AI-generated PR. The AI replaced a sequential for loop with Promise.all — looks like a performance optimization, but it changes error handling. The original continued fetching even if one failed. The refactored version stops on the first error. Subtle, dangerous, and exactly the kind of thing that passes tests but breaks in production.

Important from the Book: When reviewing AI-generated pull requests, ask yourself: "What assumptions did Claude make about the code's purpose and behavior?" If you find assumptions that aren't documented in the code, that's a sign of risk. Request clarification or additional tests.

Catching Hallucinations and Security Flaws

The final section of Chapter 10 covers two categories of AI-generated bugs that are especially dangerous: hallucinations and security flaws.

Hallucinations in refactored code look plausible but reference API methods or library functions that don't exist — like userCache.getWithExpiry(userId, "5m") when your caching library has no such method. The code compiles, looks correct, and only fails at runtime. The book provides a verification checklist: every external API call is real, every library method exists, every built-in function works as assumed.

Security flaws in AI-generated code fall into predictable categories: SQL injection (string concatenation instead of parameterized queries), authentication bypass (security checks accidentally removed during consolidation), secrets in code (hardcoded API keys), and input validation removal (skipped validation on certain code paths during refactoring).

Key Idea from the Book: The book provides a systematic security review pattern with check/fail examples for parameterized queries, authentication guards, secret management, and input validation. Apply these checks to every AI-generated change — the effort is worth it because security issues in refactored code are the most subtle and dangerous.

What Chapter 10 Sets Up

This chapter provides the safety methodology. You now have a disciplined workflow: characterization tests first, incremental refactoring with continuous verification, systematic PR review, and security/hallucination checks. This is what makes the power of Claude Code trustworthy in production environments.

Chapter 11: CI/CD Integration and Automation takes this methodology and scales it. GitHub Actions workflows that trigger Claude Code refactoring automatically, GitLab CI/CD equivalents, automated characterization test generation on every PR, and the pipeline architecture that lets your team benefit from AI-assisted code improvements continuously — with human review gates at every step.


Refactor with confidence. Chapter 10 includes the complete horror stories, the full characterization test methodology with payment processor example, the incremental refactoring workflow, the PR review checklist, and the security verification patterns. Get your copy of Master Claude Chat, Cowork and Code and learn to refactor legacy code without fear.
2026-03-11

Sho Shimoda

I share and organize what I’ve learned and experienced.