The Engineering of Intent, Chapter 6: Autonomous Orchestration Frameworks

This is Part 6 of a series walking through my book The Engineering of Intent. In the previous chapter, we looked at editors — one agent at a time, with you in the loop. This chapter is about the next tier up: orchestration frameworks that run many agents, each with a specialized role, and the tasks where that scale actually pays off.


When One Agent Isn’t Enough

Editors are excellent for human-in-the-loop work. But some work benefits from less human in the loop — long-running refactors, migrations across hundreds of files, exploratory experiments where the engineer would just be clicking accept over and over. For these, orchestration frameworks take over. Chapter 6 covers the category: Roo Code, Cline, Kilo Code, and the patterns they all share.


Task-Specific Personalities

Not every task deserves the same agent. A code reviewer and a code writer have different values. The reviewer should be skeptical, pedantic, slow to praise. The writer should be ambitious, optimistic, willing to break things. Running both with the same system prompt produces mediocre versions of each.

Modern orchestration frameworks let you define multiple personalities — modes or roles — each with its own prompt, its own tool allowlist, and its own rules of engagement. The reviewer has read-only access to the code and write access only to comments. The writer has broad write access but cannot merge. The architect can modify the context documents but not the code.

💡 Key idea: The composition of personalities is itself an engineering decision. Teams that do it well get measurably better outcomes than teams that run one monolithic agent. And the design doesn’t stop at the personalities — the interaction protocol between them matters just as much. When does the writer hand off to the reviewer? What does the hand-off look like? Who wins when they disagree? A team of agents needs governance no less than a team of humans.

Memory Banks: Context That Survives the Window

An agent’s context window is finite. For a task that spans days or weeks — a migration, a large refactor, a gradual feature rollout — the window cannot hold the whole history. Memory banks solve this: a persistent, structured summary of the agent’s learnings, read at the start of each session and updated at the end.

The structure mirrors a good codebase’s documentation — architecture, decisions made, open questions, failed experiments, known quirks. Memory banks raise one subtle issue the chapter treats carefully: provenance. When the agent claims “we decided last Tuesday to use Postgres rather than MySQL,” is that grounded in an actual decision record, or did the agent hallucinate it from an adjacent discussion? Good memory banks include pointers to source documents. Great ones version those pointers so stale references can be detected.


Orchestration Amplifies Clarity. Only Apply It Where Clarity Already Exists.

The highest-value use I’ve seen is mechanical, high-volume refactoring. A team I worked with migrated 1,200 React class components to function components with hooks over a weekend. The orchestrator chunked the work by file, executed per-file refactors in parallel, ran the test suite after each chunk, and reverted any chunk that broke tests. Human involvement was defining the transformation rules, spot-checking ten random chunks, and approving the final merge.

“Orchestration is seductive. It feels like scale. But unmoderated orchestration is the fastest way to produce the 40,000-line weekend — with the delusion of structure. Do not orchestrate exploratory work. Do not orchestrate where the specification is still forming. Do not orchestrate where you cannot articulate, in advance, how you will verify the result.”

My heuristic for when a task is orchestratable: if you can write the transformation as a set of pattern-match rules, it is mechanical; if you have to say “use your best judgment,” it is not. Orchestration is terrible for creative work. The kind of thinking that designs a new API, reconceives a UI flow, or invents a business mechanic cannot be decomposed into mechanical steps. An orchestrator will produce a mediocre solution by committee. Creative work benefits from the friction of a human pause. Do not remove friction where friction is the point.


The Test-Writing Case Study

One of the chapter’s signature stories: a mid-sized company had 20% coverage across 80,000 lines and wanted to get to 70%. They ran an orchestration experiment. The orchestrator walked the codebase, spawning sub-agents to generate tests per module; a second agent reviewed the tests for quality, kicking back any that merely echoed the implementation rather than probing it; humans sampled 10% each week and periodically tuned the prompts.

Eight weeks: 14,000 tests generated, 9,400 accepted. Coverage moved 20% to 63%. Cost: two engineers tuning prompts, plus roughly $15K of inference. The equivalent human effort would have been ten engineers for six months.

The critical detail is where they stopped. They halted at 63% because beyond that point the tests the orchestrator wrote became progressively less useful. The mechanical frontier had been reached. The remaining 7% required creative design of test scenarios the orchestrator could not supply. Knowing where the mechanical frontier lies is part of the skill.


The Economics You Can’t Ignore

Running a multi-agent pipeline is not free. Each additional agent invocation costs tokens. A reviewer agent that reads every diff doubles your inference spend. For small teams this is trivial; for larger teams it matters.

A practical optimization worth the price of the chapter: use smaller, cheaper models for routine tasks; reserve the flagship model for tasks that need it. Many orchestration frameworks support model routing per role. The reviewer that checks for obvious bugs can be a small fast model. The architect that makes high-stakes design decisions should be the largest model you can afford.


Tips for Long Orchestrations

  • Check in early and often. Eight-hour unsupervised runs are unread novels. Structure checkpoints every 30 – 60 minutes.
  • Enforce a retry budget. Agents get stuck in loops. Cap retries per subtask; surface to humans when the cap hits.
  • Log everything at the orchestrator level. Forensic review is inevitable; plan for the capability from day one.
  • Budget the spend. Teams have lost tens of thousands of dollars to a single misconfigured loop. Set a hard cap; pause on approach.
⚠ Warning: All orchestration pipelines eventually surface decisions to a human. The quality of those surfaces — the dashboards, the summary reports, the diff viewers — is as important as the orchestration logic itself. If your team cannot name the human who reviews orchestration outputs and cannot point to the surface they use, you do not have orchestration. You have a code-generation machine with no brake. Fix the surfaces first.

Next up — Chapter 7: The GenDD Execution Loop. With human, agent, and codebase infrastructure in place, Part III of the book introduces the methodology that ties them all together — Generative-Driven Development. Chapter 7 walks through the five-step loop that takes an intent from “I want to build X” to “it is in production and works.”


📖 Want the full picture?

The chapter covers the writer/reviewer/architect role decomposition, the memory-bank provenance pattern, the full test-writing orchestration (14K tests in eight weeks), the economics of multi-agent pipelines, model-routing strategies, the four tips for long-running orchestrations, and the dashboard/report designs that keep the human actually engaged.

Get The Engineering of Intent on Amazon →

2026-04-22

Sho Shimoda

I share and organize what I’ve learned and experienced.