odu-runner — a runner you can reach with nothing running
Phase 0 of odu-web, owned by the runner: split serve from run so the socket outlives any single pipeline, give runs an identity a ledger can hold (repo × sha × seq × node), and decide who keeps the long-lived coordinator alive — priced against the warm pu-box pool. Everything odu-web consumes; nothing odu-web should have to build.
odu-web opens with “Phase 0 · prerequisites in the runner” and points back at odu’s backlog. This note is that phase given its own work order: what the runner must become before a service can sit on top of it — and why each piece is the runner’s job rather than the service’s. accepted · R2 shipped
Status: accepted · maturity seedling · R2 (run identity) shipped —
#28 · juspay/odu#28 , consumed here via the npins update odu bump in this PR; R1 (serve/run split) rides the surface-daemon spine (sequenced behind kaval B1/B2) and R3 (lifecycle) follows it · consumer: odu-web Phase 1 (the ledger R2 just built) and Phase 2 (the live observer, still waiting on R1’s idle attach)
Serve, don’t spawn — the socket outlives the run
Today the answer to “is CI okay?” depends on whether a process happens to be alive. odu status, logs, attach, and the MCP face all dial .ci/odu.sock — in-band, typed, the whole point — but the coordinator serving that socket exists only between run’s first node and its verdict. The odu note flagged this honestly at Phase 1 (“idle attach — a runner you can reach with no run live — moved to Phase 2 with the long-lived-runner question”) and the MCP face inherited the shape: its run tool spawns the coordinator, which is why run is an MCP tool at all rather than just another surface call.
The split: odu serve owns the socket; run becomes a procedure on the surface. The CLI UX does not change — odu run ensures a server (starting one if absent) and invokes the procedure; attach connects whether anything is running or not. What changes is what attach shows when nothing is live:
Two design facts make this cheap rather than risky. First, the serve-only failure mode is already banked knowledge: justci built an MCP server and reverted it (
#22 · MCP server for ci ,
#18 ) because launching its server auto-ran every recipe — a runner that owns the DAG as idle state has start/serve separation by construction, and Phase 0 merely promotes that property from “by construction” to “by contract.” Second, the execution half doesn’t move: the DAG machinery, platform lanes over HostSession, status posting, and per-SHA logs are the proven substrate — they get spawned into the server instead of being the process.
The MCP face simplifies as a side effect: run stops being the odd tool that forks a process and becomes the same thin projection as the other four — and wait_for_settle can finally outlive the thing it waits on without holding the coordinator’s stdio hostage.
Run identity — the tuple a ledger can hold
odu’s surface today is implicitly singular: the nodes cell, the nodeLog stream — correct while the coordinator and the run are the same object, meaningless the moment one server hosts run 46 and run 47. odu-web’s ledger and run pages need to reference runs from outside any process lifetime, which sets the requirement: identity must be mintable by the runner, durable in the trail, and meaningful in a URL.
The tuple is (repo, sha, seq) — seq because the same SHA runs more than once (a rerun after an infra flake is a new run, not a mutation of the old one’s history) — and node addresses within a run. On the surface this lands as one new collection and a scope parameter on the existing primitives:
| Surface call | Today | After Phase 0 |
|---|---|---|
runs.get({}) | — | New cell: every run the server knows — identity, ref, verdict, timestamps. The idle-attach screen and odu-web’s ledger ingest are both projections of it. |
nodes.get({}) | The run’s nodes | nodes.get({ run? }) — defaults to the latest run, so every shipped face keeps working unmodified. |
nodeLog.get({ id }) | The run’s node log | nodeLog.get({ run?, id }) — same default. |
node.rerun({ id }) | Reset + reschedule | Unchanged semantics, latest-run scope; rerunning a finished run’s node means starting a new run (a seq bump), keeping history append-only. |
The defaulting rule is the compatibility story: latest-run scope is exactly today’s behavior, so the TUI, the /ci skill, and the MCP tools are correct on day one, and grow run parameters at leisure. The append-only rule is the ledger story: odu-web never has to model “history changed,” only “a run was added” — which is also what keeps the per-SHA on-disk trail (already durable past runner death, by Phase 1 design) a faithful serialization of the same identity rather than a second bookkeeping scheme.
Lifecycle and pricing — who keeps it alive
A long-lived process is a cost, and the odu note priced the warning into its own roadmap: “price the idle-runner lifecycle against the warm pu-box pool.” The pool (kolu-ci-1..8, leased via ci/pu/lease.sh) already solved the adjacent problem — keep the expensive thing warm (a Nix store on a Linux box) so the cheap thing (a run) starts fast. Phase 0 should not blur that: the coordinator is not the expensive thing. It is a small node process serving a socket, and its lifecycle has three honest options, not one:
- Foreground, operator-owned —
odu servein a terminal (or under the kolu app), dying with the session. Zero new infrastructure; idle attach works while you work. The right default for the single-operator loop, and Phase 0’s exit bar. - Supervised, machine-owned — a home-manager / systemd unit, already named on odu’s graduation roadmap. The right shape under odu-web, where the server and the service share a host and the unit is the service’s substrate, not the operator’s.
- On-demand with durable trail — no resident process at all:
runauto-starts a server that lingers (idle timeout) and exits, because the ledger — not the live server — is what answers history questions. This is the honest fallback that keeps “no daemon to register” true for casual consumers ofnix run github:juspay/odu.
The kaval overlap — shared spine, different soul
kaval — the PTY daemon kolu is splitting out in four PRs (
#1291 · the plan ) — needs the identical machinery: a pid-gated entry, a unix socket that outlives its clients, a contract handshake on every connect, an endpoint state machine, spawn/respawn drivers, a composed restart. That convergence now has its own plan of record — surface-daemon — which names the shared spine, sequences its extraction into @kolu/surface(-daemon) after kaval’s B2 has soaked in production, and lists odu serve as the second consumer that clears the electricity bar. The practical consequence for this note: R1 and R3 consume that spine rather than hand-rolling a copy — they sequence behind kaval B1/B2 — while R2 (run identity) has no kaval dependency at all and is the piece odu-web Phase 1 actually blocks on, so it can land first.
What does not transfer is the soul. kaval holds irreplaceable kernel state — live PTY fds — so its survival phase (B3: adoption, reconciliation) is its whole point. odu serve holds replaceable orchestration state: the trail is durable, runs are append-only, a lost run is a seq-bump rerun. So none of B3 crosses over, and this note’s “live-state resurrection stays out of scope” survives the deduplication. Same mechanism, opposite policies — which is exactly the evidence the mechanism belongs in a library.
Crash semantics follow the same line the gate half already drew: a dying lane posts error/Errored rather than wedging (shipped in Phase 1 of the odu plan), and a dying server must degrade the same way — in-flight runs marked errored in the trail, finished runs untouched, restart resuming an empty-but-serving state. Runner-restart survival of live state stays out of scope, exactly as the odu ledger scoped it: the durable artifact is the trail, and resurrection of a half-run DAG buys complexity that a seq-bump rerun buys back for free.
Phases
Each lands alone; together they retire odu-web’s Phase 0 row.
- R2 · run identity — shippedShipped in
#28 · juspay/odu#28 (through the full lens + codex + simplify + police gauntlet, odu-on-odu CI green) and consumed here via
npins update odu. Every terminal run writes a durableRunRecord—(repo, sha, seq)identity + tri-stateoutcome(passed/failed/incomplete) + timing + lane→host map + a per-node snapshot — to.ci/<sha7>/runs/<seq>.json, on natural completion, each linger drain, and the shared shutdown (so a cancelled/interrupted run records too, markedincomplete).odu runs [-o json]lists the ledger straight off disk — the first command that works against an idle checkout — and the agent face gained a read-onlyrunsMCP tool over the same trail. Scope refined in the build: the original sketch’s liverunscell + run-scope params onnodes/nodeLog/rerunwere deliberately not built — they’re meaningless until R1’s long-lived multi-run server exists, so R2 delivered the durable on-disk ledger (the rows a service face reads) instead, exactly the “applies to today’s run-scoped coordinator unchanged” path. Exit met: a CLIodu runand an MCP-spawned run land in oneodu runslist with stablesha#seqids — the ids odu-web Phase 1 puts in ledger rows andtarget_urls.seqmakes a rerun of one commit a new record, not an overwrite. - R1 · the serve/run split — rides the kaval spine
odu serveowns.ci/odu.sock;runbecomes a surface procedure; the CLI keeps its exact UX by ensuring a server before invoking it. The MCP face’sruntool becomes a thin projection like its siblings. The entry — pid-gate (per-repo scope key), serve loop, handshake-on-connect — comes from the spine (surface-daemon; both halves are packages from birth, kaval B1/B2), so this sequences behind kaval B2 and S1’s handshake move. Exit:odu attachconnects with nothing running and shows a (possibly empty) runs list — without odu defining a pid-gate, entry sequence, or handshake of its own. - R3 · lifecycle, crash semantics, pricingThe three lifecycle modes made real: foreground default, the home-manager unit (graduation roadmap item), idle-timeout auto-serve as the no-daemon fallback — lifetime is a spine parameter, the mode choice is this program’s policy. The supervisor half (endpoint states,
waitForPidGone, composed restart) likewise comes from surface-daemon. Server death marks in-flight runs errored in the trail and restarts clean. The pricing note written against the warm pu-box pool: what stays warm, what spawns, and why the coordinator is never the expensive half. Exit: kill the server mid-run — the trail shows errored, restart serves history, nothing wedges.