← the Atlas

LLM Workflow Autonomy — A Self-Improvement Loop

Analysis· budding ·proposed ·

A recurring audit of how autonomous our PR workflow is. Round 1 covers today's workflow, /be — 88 runs reveal where it still needs a human, the levers to close the gap, and the repeatable method to re-run the check as the workflow evolves.

Our PR workflow should run autonomously: one decision from the human up front, then a shipped, reviewed PR with zero further turns. Today that workflow is /be — but the workflow will change, so this note is framed as a recurring LLM-autonomy self-improvement loop, not a one-off /be post-mortem. This round measures how autonomous /be actually is, mined from the real session logs, and proposes how to close the gap. The last section is the runbook so we can re-run the check against whatever drives our PRs next and watch the number move.

Method: crawled every Claude Code JSONL session in a kolu worktree from the last ~4 weeks, kept the 88 that genuinely invoked /be, extracted only the human-typed turns (the follow-up prompts — the manual interventions), and fanned out a Workflow that classified all 493 interventions, clustered them, and ranked the fixes. Date: 2026-06-19 · window 2026-05-30 → 2026-06-19.

The autonomy gap

Across 88 runs the mean autonomy score was 53/100 and the median 54 — a coin-flip. Only 16 runs (18%) ran fully clean (the initial /be plus nothing else); the largest band, 25 runs, sat at 0–19 (a wall of interruptions). The worst — reload-error (32 interventions), pty-daemon-phase-b (20), video (18), ccloading/cc-scrape (17) — were multi-hour slogs the human hand-drove stage by stage.

MetricValue
/be runs analyzed88 (2026-05-30 → 2026-06-19)
Total manual interventions493
Mean interventions per run5.6
Mean / median autonomy53 / 54 out of 100
Fully autonomous runs (0 interventions)16 of 88 (18%)
Severity mix112 blockers · 211 corrections · 165 nudges · 5 preferences
Autonomy bandRuns
100 — clean16good
80–9915
60–7911
40–5912
20–399
0–19 — hand-driven25bad

What the clean runs have in common is the lesson. The 16 fully-autonomous runs (click-file-ref, dot-in-link, cc-fork, sick-coat, old-defect, …) were overwhelmingly well-specified bug issues with a pre-existing @skip e2e scenario/be had an unambiguous target and a ready-made RED test, so it ran end-to-end unattended. Autonomy is not a model-capability problem; it is an enforcement + specification problem. The fix is to make the contract deterministic (exit codes, not prose) and to give every run the ground truth the clean runs happened to have.

User: /be <task>§0 Interview — the ONE sanctioned questionrun-state spine (.do-results.json)stage · pr · ci · evidence · verified— /be does NOT write this today§1–§5 stagessetup · implement · PR · gauntlet · ci+evidenceGuards — exit codes, not proseHuman (today: the gate)'continue' · 'run ci' · 'DONT KILL PROD' · 'post evidence'Done — only when every guard passesStop hook (extend)block turn-end while active=working+ Done post-conditions: ci-green · evidence · gauntlet-comments · repro-greenPreToolUse hooks (new)block prod-kolu kill / prod-port bindblock AskUserQuestion after §0 write at each boundaryreadblock & auto-resume today: manual re-drive
The proposed enforcement backbone. Today the green spine and the guard boxes do not exist for /be — its stages are connected by prose only, so the run yields the turn and the human (dashed) re-drives it. The fix: /be writes a run-state spine at every stage boundary, and the existing Stop hook (extended) plus new PreToolUse hooks read it to block turn-end and dangerous actions — converting 'continue', 'run ci', 'don't kill prod', 'post evidence' from human nudges into automatic resume/refusal.

The spine is one artifact with four jobs: (1) Stop-guard self-resume, (2) post-compact / post-crash re-entry, (3) the “interview done” marker a PreToolUse hook reads to block mid-run questions, and (4) the Done post-condition fields. Build it once; four enforcement mechanisms hang off it.

Where the interventions happen

Every human follow-up was tagged against a fixed taxonomy. Two failure modes dominate: the model builds the wrong thing (it never consults the project’s own sources of truth before coding) and the model stops too early (no deterministic “keep going”).

CategorynSeverityThe recurring pattern
wrong-approach93criticalCommits to a default (fallbacks, hand-rolled config, wrong package boundary) without reading conventions / electricity.mdx; no design-seam self-check before the human sees it
incomplete-stopped-early85criticalSilently halts between stages; human types “continue” / “finish the /be workflow”
continue-or-resume-nudge59criticalYields after interrupts, compaction, config commands; no auto-resume
evidence-missing-or-wrong46criticalDeclares done with no / tests-only / unplayable / wrong-target artifact; human is the visual linter
requirements-clarification40highLaunches on terse/argless prompts and a linked spec it never fully read; scope surfaces mid-run
tooling-env-failure27criticalSession limits / crashes with no resumable checkpoint; codex-login & CI policy hit mid-gauntlet; orphaned pu/dev resources
regression-introduced24criticalShips unexercised changes; kills production kolu; breaks interactive controls — human is the regression detector
manual-verification-needed22high”tests pass” stands in for “I watched it work”; human re-runs the repro / clicks the control / deploys
review-gauntlet-gap22criticalGauntlet doesn’t self-execute or self-verify; missing PR comments/commits; lenses miss recurring smells
repro-or-test-inadequate19criticalRED-repro never confirmed red for the right reason; asserts a proxy, not the user’s literal invariant
ci-failure14criticalYields before CI is green; skips master-sync, downstream PRs, per-node triage
overengineering-or-scope10highGold-plated guards, single-use wrapper files; gauntlet doesn’t re-fire on its own later commits
pr-hygiene9highCommits sit unpushed; title/body drifts; base goes stale — “always push wtf”
subjective-preference9lowMostly irreducible taste that should route to /talk; a few undocumented-convention cases
other10lowBranch rot (“merge latest master”), illegible review output, unverified cited issues
plan-feedback4lowFirst plan draft misses the bar (no prototype, wrong altitude) during the sanctioned §1 pause

The follow-up prompts themselves are the evidence — the same handful of frustrations, run after run:

Stopped early / nudge — “You must finish the /be workflow.” · “idiot, fucking build it in the PR” · “Continue from where you left off.” · “status?” (after an hour of silence)

Wrong approach — “Doesn’t parcel already support gitignore” · “Being able to ‘override’ is never a feature” · “solid-BROWSER’s only concern is BROWSING. per electricity.mdx”

Regression / destructive — “DO NOT FUCKING KILL PRODUCTION KOLU” · “you killed production kkolu … You are suppose to run the dev server with random ports.” · “wtf, I can no longer click on ‘Update’ button?”

Evidence / verification — “Your evidence is shit.” · “your mp4 doesn’t play either” · “did you test your changes? … both back and fwd buttons remain disabled” · “once you finish the PR, you must re-run your repro”

Gauntlet / CI / hygiene — “where are the codex and lowy/hicky commits?” · “github says red” · “Ignore CI, I merged it.” · “always push wtf”

The common thread: a rule the codebase already documents arrives as an angry mid-run interrupt, because nothing surfaced it at the right moment or stopped the model from shipping past it.

The roadmap

Fourteen deduplicated improvements, ranked by impact ÷ effort. The program is mechanical, not prose: the top items convert the most-violated rules into exit-code gates. Targets are the .apm/ sources (they regenerate into .claude/, .codex/, .agents/ via just ai::apm — never edit the generated copies), .agency/do.md, and settings.json hooks.

#ImprovementLeverEffort · Impact
1Wire /be into the existing Stop guard — write the generic .do-results.json the guard already understands.apm/skills/be/SKILL.md §1/§Done; reuse do/scripts/do-results. No hook change neededS · high
2First PreToolUse Bash hook: hard-block prod-kolu kills & prod-port bindsnew .apm hook → settings.json PreToolUse; cite from dev-server skillM · high
3Full /ci on every touched PR + master-sync at the head of §5.apm/skills/be/SKILL.md §5; .agency/do.md ## CIS · high
4 shipped #1418 Codify design philosophy as an always-loaded rule (fail-fast/no-fallbacks · electricity boundaries · reuse-existing-source) and force §2 to read it.apm/instructions/conventions.instructions.mdconventions.md; be §2S · high
5Hard-gate Done on real artifacts — extend the Stop guard with ci-green + evidence-present + gauntlet-comments-presentdo-stop-guard.sh (apm source); /be writes the fieldsM · high
6## CI failure triage policy — named flaky lanes + per-node enumerate/fix-or-waive.agency/do.mdS · high
7Make §4 a self-driving in-process Skill chain; forbid handing reviewers back to the user.apm/skills/be/SKILL.md §4; be-reviewM · high
8§3 deliverable-coverage gate + grep-before-assert — never report a PR whose diff doesn’t match the task; never claim “no fallbacks” uncheckedbe §3; code-police / fact-checkM · med
9Machine-checked RED→GREEN repro & observed-green before Done (tests passing ≠ verified)be §2; Stop guard verified fieldM · high
10§0 echoes a concrete task contract; PreToolUse blocks AskUserQuestion after §0be §0; new PreToolUse hookM · med
11§2 design-seam self-check — run lowy + hickey on the seam before building, not at §4be §2M · med
12Heartbeat + wall-clock budget + auto-retry on long review/ship sub-skillsbe-review, codex-debate, lens-debate, evidenceM · med
13Auto-classify evidence necessity from the diff; self-emit “no visual impact” so backend PRs don’t false-blockbe §5; evidence §0M · med
14Lens/police hard probes for recurring smells + re-fire gauntlet on post-gauntlet commitslens-debate, code-police, be-reviewL · med

Re-running this check

This is meant to be a recurring audit — run it after a batch of the fixes land and watch mean autonomy climb. The whole thing is one inline Workflow fanning out over compact per-run extracts; reproduce it like this.

1 — Find the /be sessions (kolu worktrees, last ~4 weeks). Logs live at ~/.claude/projects/-home-srid-code-kolu--worktrees-*/*.jsonl. A real /be invocation shows up as a user turn containing <command-name>/be</command-name> (with the task in <command-args>), or a Skill tool-use with "skill":"be". Exclude nested …/subagents/… logs. Catch continuation/resume sessions too — group candidate worktree dirs, then take every session in them within the window (a /be run often spills across resumes, and the resume itself is an intervention signal).

2 — Extract only the human turns. This is the crux: a "type":"user" entry is not a human prompt — most are tool results. With jq, keep entries where isSidechain==false and toolUseResult==null and the content is text (or an array with text/image blocks; imgs>0 = a pasted screenshot, a strong correction signal). Drop machinery: <local-command-…> stdout/caveats, <task-notification> blocks, and [Image: source:…] placeholders. Tag each turn CMD:/name (slash command — its <command-args>) vs FOLLOWUP (freeform). The first turn is the /be task; every later FOLLOWUP is a candidate intervention. Rank by genuine follow-up count, not raw user-count, and drop trivial sessions (< ~4 human turns). This run kept 88 units from ~96 candidates.

3 — Fan out an analysis Workflow (ultracode makes it adversarial and deep):

4 — Feed the agents the levers and verify the claims. Tell each agent where change is possible (the .apm/ skill sources, settings.json hooks, .agency/do.md, the rule files) so howToPrevent is actionable, not “be more careful.” Then — practicing the rigor the note preaches — confirm every load-bearing factual claim before publishing (this round verified do-stop-guard.sh is /do-only .claude/hooks/agency/scripts/do-stop-guard.sh:1-15, that settings.json has only a Stop hook, and that conventions.md has no design-philosophy section).