← the Atlas

odu-runner — a runner you can reach with nothing running

feature · seedling ·accepted ·

Phase 0 of odu-web, owned by the runner: split serve from run so the socket outlives any single pipeline, give runs an identity a ledger can hold (repo × sha × seq × node), and decide who keeps the long-lived coordinator alive — priced against the warm pu-box pool. Everything odu-web consumes; nothing odu-web should have to build.

odu-web opens with “Phase 0 · prerequisites in the runner” and points back at odu’s backlog. This note is that phase given its own work order: what the runner must become before a service can sit on top of it — and why each piece is the runner’s job rather than the service’s. accepted · R2 shipped

Status: accepted · maturity seedling · R2 (run identity) shipped #28 · juspay/odu#28 , consumed here via the npins update odu bump in this PR; R1 (serve/run split) rides the surface-daemon spine (sequenced behind kaval B1/B2) and R3 (lifecycle) follows it · consumer: odu-web Phase 1 (the ledger R2 just built) and Phase 2 (the live observer, still waiting on R1’s idle attach)

faces — TUI · MCP · odu-web (its Phase 2)odu serve — the long-lived coordinatorper-run execution — unchangedper-SHA logs + verdicts — the trail odu-web ingests.ci/odu.sock — outlives any runruns cell — identity: repo × sha × seqrun — a procedure on the surfaceDAG · nodes · nodeLog · rerunplatform lanes over HostSession attach any time — idle includedspawns runs into itselfon settle — survives the server
Phase 0 as one inversion. Faces (top) attach to a long-lived server that owns the socket and a runs collection; per-run execution (the proven DAG machinery) is unchanged underneath, spawned into the server instead of being the process. Verdicts flow out to the per-SHA trail odu-web's ledger ingests.

Serve, don’t spawn — the socket outlives the run

Today the answer to “is CI okay?” depends on whether a process happens to be alive. odu status, logs, attach, and the MCP face all dial .ci/odu.sock — in-band, typed, the whole point — but the coordinator serving that socket exists only between run’s first node and its verdict. The odu note flagged this honestly at Phase 1 (“idle attach — a runner you can reach with no run live — moved to Phase 2 with the long-lived-runner question”) and the MCP face inherited the shape: its run tool spawns the coordinator, which is why run is an MCP tool at all rather than just another surface call.

The split: odu serve owns the socket; run becomes a procedure on the surface. The CLI UX does not change — odu run ensures a server (starting one if absent) and invokes the procedure; attach connects whether anything is running or not. What changes is what attach shows when nothing is live:

odu attach — idle, nothing running
$ odu attach
» dialing .ci/odu.sock … connected (idle — 0 running)
odu · juspay/kolu ● serving
recent runs
seq sha ref verdict when
47 26d2c2d master ✓ green 2h ago
46 53c0889 PR #1291 ✗ e2e 5h ago
45 b01c635 master ✓ green 1d ago
r run HEAD [enter] open a run q quit

Two design facts make this cheap rather than risky. First, the serve-only failure mode is already banked knowledge: justci built an MCP server and reverted it ( #22 · MCP server for ci , #18 ) because launching its server auto-ran every recipe — a runner that owns the DAG as idle state has start/serve separation by construction, and Phase 0 merely promotes that property from “by construction” to “by contract.” Second, the execution half doesn’t move: the DAG machinery, platform lanes over HostSession, status posting, and per-SHA logs are the proven substrate — they get spawned into the server instead of being the process.

The MCP face simplifies as a side effect: run stops being the odd tool that forks a process and becomes the same thin projection as the other four — and wait_for_settle can finally outlive the thing it waits on without holding the coordinator’s stdio hostage.

Run identity — the tuple a ledger can hold

odu’s surface today is implicitly singular: the nodes cell, the nodeLog stream — correct while the coordinator and the run are the same object, meaningless the moment one server hosts run 46 and run 47. odu-web’s ledger and run pages need to reference runs from outside any process lifetime, which sets the requirement: identity must be mintable by the runner, durable in the trail, and meaningful in a URL.

The tuple is (repo, sha, seq)seq because the same SHA runs more than once (a rerun after an infra flake is a new run, not a mutation of the old one’s history) — and node addresses within a run. On the surface this lands as one new collection and a scope parameter on the existing primitives:

Surface callTodayAfter Phase 0
runs.get({})New cell: every run the server knows — identity, ref, verdict, timestamps. The idle-attach screen and odu-web’s ledger ingest are both projections of it.
nodes.get({})The run’s nodesnodes.get({ run? }) — defaults to the latest run, so every shipped face keeps working unmodified.
nodeLog.get({ id })The run’s node lognodeLog.get({ run?, id }) — same default.
node.rerun({ id })Reset + rescheduleUnchanged semantics, latest-run scope; rerunning a finished run’s node means starting a new run (a seq bump), keeping history append-only.

The defaulting rule is the compatibility story: latest-run scope is exactly today’s behavior, so the TUI, the /ci skill, and the MCP tools are correct on day one, and grow run parameters at leisure. The append-only rule is the ledger story: odu-web never has to model “history changed,” only “a run was added” — which is also what keeps the per-SHA on-disk trail (already durable past runner death, by Phase 1 design) a faithful serialization of the same identity rather than a second bookkeeping scheme.

Lifecycle and pricing — who keeps it alive

A long-lived process is a cost, and the odu note priced the warning into its own roadmap: “price the idle-runner lifecycle against the warm pu-box pool.” The pool (kolu-ci-1..8, leased via ci/pu/lease.sh) already solved the adjacent problem — keep the expensive thing warm (a Nix store on a Linux box) so the cheap thing (a run) starts fast. Phase 0 should not blur that: the coordinator is not the expensive thing. It is a small node process serving a socket, and its lifecycle has three honest options, not one:

The kaval overlap — shared spine, different soul

kaval — the PTY daemon kolu is splitting out in four PRs ( #1291 · the plan ) — needs the identical machinery: a pid-gated entry, a unix socket that outlives its clients, a contract handshake on every connect, an endpoint state machine, spawn/respawn drivers, a composed restart. That convergence now has its own plan of record — surface-daemon — which names the shared spine, sequences its extraction into @kolu/surface(-daemon) after kaval’s B2 has soaked in production, and lists odu serve as the second consumer that clears the electricity bar. The practical consequence for this note: R1 and R3 consume that spine rather than hand-rolling a copy — they sequence behind kaval B1/B2 — while R2 (run identity) has no kaval dependency at all and is the piece odu-web Phase 1 actually blocks on, so it can land first.

What does not transfer is the soul. kaval holds irreplaceable kernel state — live PTY fds — so its survival phase (B3: adoption, reconciliation) is its whole point. odu serve holds replaceable orchestration state: the trail is durable, runs are append-only, a lost run is a seq-bump rerun. So none of B3 crosses over, and this note’s “live-state resurrection stays out of scope” survives the deduplication. Same mechanism, opposite policies — which is exactly the evidence the mechanism belongs in a library.

Crash semantics follow the same line the gate half already drew: a dying lane posts error/Errored rather than wedging (shipped in Phase 1 of the odu plan), and a dying server must degrade the same way — in-flight runs marked errored in the trail, finished runs untouched, restart resuming an empty-but-serving state. Runner-restart survival of live state stays out of scope, exactly as the odu ledger scoped it: the durable artifact is the trail, and resurrection of a half-run DAG buys complexity that a seq-bump rerun buys back for free.

Phases

Each lands alone; together they retire odu-web’s Phase 0 row.