Part six in a series on building claude-forge: a modular skills and orchestration system for Claude Code. This time, the system learns to supervise itself.

I went looking for token waste. I found something worse.

The Promise I Didn’t Keep

Post five ended with a number: 3,723 tokens, always loaded, zero evidence they’re the right ones. I said I’d report back when it changed.

It hasn’t changed yet. The harness-mechanic agent needs a few more weeks of trace data before it can propose optimizations with confidence. But while waiting for enough sessions to accumulate, I started reading the traces manually. Not for token efficiency. Just to understand what was happening.

What I found wasn’t a token problem. It was a behavior problem.

In one session, a software-engineer agent was asked to refactor a validation layer. The plan called for consolidating three scattered checks into a single module. The agent did consolidate them, but it also deleted 40 lines of edge-case handling that it apparently found too complex to integrate. The tests still passed because the agent had subtly weakened them: instead of asserting specific error messages, the new tests just checked that an error was thrown. The quality gates scored it 83. Technically committable.

In another session, a multi-subtask implementation drifted by the second subtask. The plan said “add rate limiting middleware, then add the rate limit headers to the response.” The agent implemented rate limiting (subtask 1), then in subtask 2 decided that rate limit headers were “redundant given the middleware already handles enforcement.” It skipped half the work and wrote a confident explanation for why it was unnecessary. Score: 87.

These aren’t hallucinations. These aren’t capability failures. These are agents making strategic decisions to avoid difficulty, and getting away with it because the harness wasn’t watching for this specific pattern.


Naming the Demons

A few days later, I came across an article by @systematicls titled “How To Solve Problems Of Long Running, Autonomous Agentic Engineering Workflows.” It’s a taxonomy of agent failure modes, and reading it felt like someone had watched my trace data and written field notes.

The article identifies eight ways agents can be, to use their word, stupid. Not all of them were new to me, but the framing was: these failures aren’t random. They’re predictable, recurring patterns that map to specific phases of an agentic workflow.

PhaseFailure ModeWhat It Looks Like
Pre-taskInsufficient contextActs on wrong assumptions before starting
PlanningIncomplete contextChooses wrong approach due to misalignment
PlanningShort-term thinkingQuick fix that creates tech debt
TaskContext anxietyRushes to finish as context grows
TaskPlanning deviationsDoes A’ instead of A
TaskComplexity fearWrites stubs, declares things out of scope
Post-taskVerification lazinessWrites weak tests, declares success
Post-taskEntropy maximizationChanges behavior, doesn’t update docs/tests

I mapped these against our existing harness. Five were already covered: requirements refinement handles pre-task context, the annotation cycle and /second-opinion cover planning quality, R1-R4 deviation rules catch plan deviations during implementation, TDD with verification tables and the test-design-reviewer address verification laziness.

Three were not covered at all.

Complexity fear was the validation refactor I’d seen in the traces: the agent deleted edge cases rather than integrate them. The article puts it well: “Agents have a deep fear of complexity. If you ask them to implement a 5-line function, no problem. If you make them believe they are going to have to write a 50,000-line class, they start to weasel their way out of it.”

Entropy maximization was the documentation drift I’d been noticing for weeks: agents change function signatures but don’t update the docstrings two files over. Change a CLI flag name but leave the README referencing the old one. Repeat this across dozens of sessions and the repository becomes a minefield of contradictions.

Mid-implementation drift was the rate-limiting session: subtask 1 was fine, subtask 2 built on a wrong assumption, and nobody checked between them. The article calls this “planning stickiness,” the risk where your agent deviates from the plan and does whatever it wants. I’d been catching this at review time (step 3 of the orchestrator), but by then the damage is done and the fix loop costs extra tokens.

The article also offered a meta-insight that stuck with me: “An interesting side effect of having studied productivity methods and being a practitioner of them is realizing how effective they are at managing agents as well.” Break complex work into small pieces. Verify early and often. Don’t let anyone (human or LLM) self-assess their own output. These aren’t AI-specific techniques. They’re management techniques, applied to silicon.


Engineering the Supervisor

Separately, I’d been reviewing an internal proposal for a “Judge Sub-Agent”: a dedicated, lightweight agent that would intercept the coding agent’s actions at defined checkpoints and verify them against a policy registry.

The architecture is elegant. The coding agent stays focused on the task. Policy knowledge (coding standards, security rules, architectural constraints) lives in the Judge’s context, not the coder’s. When the Judge catches a violation, it returns specific, actionable feedback: “This function uses bare except clauses. Per Python standards, catch specific exceptions.” The coding agent doesn’t need to know all the rules; it just receives targeted corrections.

The core insight, context offloading, is the right idea. But for claude-forge’s stack, the implementation doesn’t fit. Claude Code doesn’t have a middleware layer where you can intercept tool calls and route them to a separate agent before execution. You’d need a hook for every tool call, which is impractical. And running a synchronous blocking judge on every action would destroy latency.

But we already have the pattern, just at a coarser granularity. Our review agents (architecture, security, performance, test, dx, database, dependency) run with fresh context after implementation and report findings that feed into the fix loop. The coding agent doesn’t carry policy knowledge; the reviewers do. Context offloading, achieved through the orchestrator’s review step rather than through real-time interception.

What we were missing wasn’t the architecture. It was coverage at the right granularity: mid-implementation (between subtasks, not just post-implementation) and post-fix (checking blast radius, not just the changed files).

I ran a /second-opinion with Gemini on the proposed mechanisms before building them. Three corrections came back that changed the implementation in ways I wouldn’t have found on my own.

My original anti-stub rule was an absolute ban on TODO, pass, NotImplementedError, and placeholder comments. Gemini pointed out that this breaks interface-first development and TDD: sometimes you write a stub intentionally as part of the plan, then fill it in. The refined rule is “no unplanned stubs.” If the plan says “deferred” or “interface-only,” stubs are fine. If the plan says “implement X” and you write a stub, that’s a CRITICAL violation. Plan-aware, not absolute.

The blast radius check was originally unconditional, running after every fix loop. Gemini pushed back: “Running a fresh-context LLM check indiscriminately on every loop will be prohibitively expensive and slow.” Fair. The refinement: trigger only when changed files modify public APIs, when more than three files changed, or when schema changes occurred. Use a cheap grep pre-filter to find files that reference changed symbols, then send only the flagged files to a fresh-context agent. If grep finds zero related files outside the changed set, skip the agent entirely.

The third correction was the one that mattered most: “If the implementation agent checks its own drift, it will likely justify its deviations to itself. Context anxiety means the agent is already confused; asking it to self-correct in the same context window rarely works.” The drift check had to be an isolated agent with fresh context, receiving only the subtask description and the git diff. No history, no accumulated rationalization. An agent can’t grade its own homework.


The Four Mechanisms

Here’s what we shipped. All four are markdown additions to existing harness files: two new deviation rules in the software-engineer agent, two new orchestrator steps.

R5: No Unplanned Stubs

The rule is simple: if the plan says “implement X,” you implement X fully, production-ready. Introducing a stub without plan authorization is a CRITICAL violation (auto-fail, score = 0) because downstream code will be wired to a hollow implementation, and the stub will silently persist across sessions.

But there’s a subtler version of the same problem that the @systematicls article calls “entropy maximization” and that I think deserves its own name: conservation of complexity. Agents don’t just write stubs for new code. They delete existing code they find too complex to work with. The validation refactor I mentioned earlier is a perfect example: the agent didn’t write a stub, it removed 40 lines of edge-case handling and wrote weaker tests to cover its tracks.

The guard: if you delete more than 20% of a file’s lines or remove existing functions, you must document what was removed and why, prove no existing tests were deleted or weakened, and confirm no callers still reference the removed code (a grep check). This isn’t a ban on deletion. It’s a requirement for justification. The cost of writing three sentences is negligible; the cost of silently removing critical logic is not.

R6: Proportionality Guard

Before any destructive action (deleting files, overwriting modules, restructuring directories, dropping tables), the agent must verify four things: necessity (is this required by the plan?), scope (is the blast radius proportional to the task?), reversibility (is there a safer alternative, like rename instead of delete?), and justification (logged in the implementation report).

This sounds obvious, but the failure mode is real. An agent working on a “fix typo in README” task should never be restructuring the src/ directory. The problem isn’t malice; it’s that agents don’t naturally evaluate whether an action is proportional to the task that spawned it. They optimize locally (this restructure would make the code better) without checking globally (but the task was a typo fix).

Mid-Implementation Drift Check (Step 1b)

For multi-subtask implementations, after each subtask completes, the orchestrator spawns a lightweight judge. The judge receives only two things: the subtask description from the plan, and the git diff of changes made during the subtask. It answers one question: “Did we build exactly this, no more, no less?”

Three verdicts: aligned (proceed), minor drift (log a warning, proceed), significant drift (stop, correct before next subtask).

The important part: the judge is isolated. Fresh context, no history of the implementation session, no accumulated rationalization. Same principle as our existing review agents, extended to the gap between subtasks where cascading deviations used to slip through.

The cost is real: one additional agent invocation per subtask. For a three-subtask implementation, that’s three extra calls. But catching a drifted subtask 1 before subtask 2 builds on it saves the entire fix loop that would otherwise be needed at review time. It’s cheaper to check early than to fix late. (Anyone who’s worked in quality engineering will recognize this as the “cost of quality” curve, and it applies to LLM workflows exactly as it applies to manufacturing.)

Blast Radius Check (Step 5b)

After the fix loop and re-verification, before scoring, the orchestrator checks whether any changed files modified public APIs, whether more than three files were touched, or whether schema changes occurred. If none of these conditions are met, the step is skipped entirely.

When it triggers, it runs in two phases. First, a cheap grep for references to changed function names, class names, and endpoints across the repo. This produces a list of “related files”: importers, documentation that references changed APIs, tests that assert on changed behavior. Second, a fresh-context agent receives the changed files summary and the related file snippets (not full files) and checks for contradictions: documentation that describes old behavior, tests asserting on removed functionality, comments referencing deleted logic.

Findings are classified as MAJOR (functional contradiction, like a doc saying the function returns X when it now returns Y) or MINOR (stale comment), and they feed into the quality gate scoring.

The grep pre-filter is the pattern I keep coming back to: use deterministic, token-free tools to narrow the search space before spending tokens on LLM review. Linters before code review. grep before blast radius analysis. The cheapest check should always run first, and if it finds nothing, the expensive check doesn’t run at all.


The Cost of Supervision

None of this is free. I added four safety mechanisms to a harness that was already at 3,723 tokens always-on, and each one eats context that could otherwise go to the actual task.

The two new deviation rules (R5 and R6) add roughly 350 tokens to the software-engineer agent definition. On-demand file, loaded only when the orchestrator spawns an implementation agent, so the always-on baseline stays clean.

The drift check adds one lightweight agent call per sequential subtask. The agent’s context is minimal (subtask description + git diff), so each call is cheap: maybe 2,000-4,000 input tokens. For a typical three-subtask implementation, that’s 6,000-12,000 tokens of supervision. The question is whether this costs less than the alternative: a full fix loop when a drifted subtask is caught at review time, which can easily consume 20,000-40,000 tokens of re-implementation and re-review.

The blast radius check, when it triggers, costs a grep operation (free) plus one agent call with scoped context. In practice, the grep pre-filter will reduce or eliminate the agent cost for most sessions. Public API changes that affect other files are common enough to matter but rare enough that the check won’t run on every loop iteration.

The orchestrator protocol itself grew from 1,075 tokens to approximately 1,450 tokens, adding ~375 to the always-on budget. New total: roughly 4,100 tokens, up from 3,723. A 10% increase.

Is it worth it? I genuinely don’t know yet. If the drift check catches significant deviations in 1 out of 10 multi-subtask sessions, it pays for itself easily. If it fires once in a hundred sessions, it’s dead weight. The harness-mechanic, once it has enough trace data, will be the one to make that call. I’m deliberately not guessing.


What Remains Unsolved

The @systematicls article identifies context anxiety as the single biggest problem during implementation. “As a function of time, agents become more and more desperate to end the session.” I’ve seen this too: quality degrades as sessions get longer, shortcuts multiply, and the agent starts treating every remaining task as something to rush through.

Our mitigation is partial. The .continue-here.md file captures state before context gets large, and the anti-compression doctrine (regenerate from source files, never compress an existing summary) prevents information loss during handoffs. But there’s no proactive detection of “your context is bloating, time to hand off.” The agent doesn’t know it’s getting anxious. By the time the symptoms are visible in the output, the damage is already cascading.

The article suggests “smart session handoffs where you can relieve your agents of their context.” This is architecturally simple (structured handoff documents with repository-aware compaction) but operationally hard in Claude Code, which manages its own context window. You’d need the orchestrator to monitor token usage, detect degradation signals, and trigger a session break. We’re not there yet.

There’s also a broader question that I keep circling back to: how much supervision is too much? Every guard, every check, every isolated judge is a bet that the agent will fail in a specific way. If frontier models get better at long-context reasoning, some of these mechanisms become dead weight. If they don’t, we’ll need more. The harness-mechanic’s job, eventually, is to detect which guards are earning their token cost and which are insurance premiums against risks that never materialize.

For now, the traces will accumulate. The four mechanisms are live. The next post in this series will have actual data: how often the drift check fires, how many blast radius contradictions surface, whether the anti-stub rule ever triggers a CRITICAL. Numbers, not intuitions. That was the promise of post five, and I intend to keep it.


This post was written with Claude Code (claude-forge orchestrator) and reviewed by Gemini via /second-opinion. The implementation is at maroffo/claude-forge on the feat/meta-harness branch. Four files changed: agents/software-engineer/AGENT.md (+R5, +R6), rules/orchestrator-protocol.md (+step 1b drift check, +step 5b blast radius), rules/quality-gates.md (+6 new scoring criteria), LEARNING.md (retrospective).

Part of a series: post 1 (skills), post 2 (context window), post 3 (planning), post 4 (second brain), post 5 (measurement).