@kolu/surface-daemon — one spine for kaval and odu serve
kaval (the PTY daemon) and odu serve (the long-lived CI coordinator) need the identical lifecycle machinery — pid-gated entry, a unix socket that outlives clients, a contract handshake on every connect, an endpoint state machine, spawn/respawn drivers. This note names that shared spine, says what is mechanism (extract) versus policy (keep per-program), and sequences it: the daemon half is born as the package in kaval B1; the supervisor half is born as its own `@kolu/surface-daemon-supervisor` package in kaval B2 (package boundary = staleKey boundary = spine/soul line); S1 moves the handshake fragment into `@kolu/surface`; odu serve consumes the whole as the second tenant (S2).
Two plans of record arrived at the same machinery from opposite ends in the same week: kaval B1/B2 designs a daemon entry, single-instance gate, handshake, and client-side supervisor for the PTY daemon; odu-runner R1/R3 sketches the same four things for the CI coordinator. This note is the deduplication — named before either hand-rolls a second copy. accepted
Status: accepted — the daemon half shipped as @kolu/surface-daemon in kaval B1 (
#1301 ); the supervisor half is born as @kolu/surface-daemon-supervisor in kaval B2 (revised 2026-06-12; the pty-daemon brief carries both); the durable stdio front frontDaemonOverStdio lands in kaval-sessions P2.5 (
#1374 ) · maturity seedling · first consumer: kaval (B1/B2) · second consumer: odu serve (R1/R3) — the second tenant that proves the electricity bar by construction (S2)
Two daemons, one spine
The correspondence is one-to-one, and kaval’s column is the more battle-hardened — every row carries a #1034/#1275 scar and a designed answer:
| Lifecycle concern | kaval (B1/B2 — designed, hazard-annotated) | odu serve (R1/R3 — sketched) |
|---|---|---|
| Process entry | daemonMain: pid-gate → serve loop → SIGTERM teardown | ”odu serve owns the socket” — same sequence, unnamed |
| Single instance | Atomic pid-gate — write-temp + link(2), liveness-probe on EEXIST, stale-unlink-retry | Implied, undesigned |
| Socket | $XDG_RUNTIME_DIR/kaval/kaval.sock; app-name parameterized in B0 | .ci/odu.sock, per-repo |
| Skew safety | Contract handshake on every connect; never an import-time throw | Needed identically — a pinned nix run client against a newer resident server |
| Supervision | endpoint.ts: connecting → connected → degraded → dead, status emitted on every transition, keyed by hostId | odu’s connection cell (copying → connecting → connected) — same idea, fewer states |
| Spawn/respawn | systemd-run --user + per-spawn unique units / macOS detached+unref; waitForPidGone (ESRCH poll, load-aware ceiling) | R3’s home-manager mode needs exactly this |
| Restart | One composed sequence, steps non-optional in the type, serialized | R3’s “kill the server mid-run — nothing wedges” |
| Honest death | Degraded state visibly distinct from “you have no terminals" | "errored in the trail, restart serves history” |
| Staleness | staleKey = nix hash of the daemon closure; “what would a restart gain?” | Unaddressed — but a resident odu serve under odu-web is a daemon that can fall a build behind, same question |
The asymmetry in maturity is the sequencing argument: kaval’s spine is designed against eight production failures already paid for (the impossible-by-construction table in pty-daemon), and B2’s staged-prod gate will soak the exact restart race (#1034) that odu serve would otherwise rediscover. The spine should be born there, not invented twice.
Mechanism vs policy — what stays per-program
The extraction is safe only because the line between spine and soul is sharp. Three asymmetries between the two consumers are load-bearing, and the spine must parameterize around them rather than absorb them:
- Survivor semantics. kaval holds irreplaceable kernel state — live PTY fds that cannot be reconstructed; B3’s survival, adoption, and reconciliation are its whole point. odu serve holds replaceable orchestration state — the per-SHA trail is durable on disk, runs are append-only, and a lost in-flight run is a
seq-bump rerun. B3 is not spine. Adoption,reconcile.ts, the schema round-trip — none of it transfers, and odu-runner’s “live-state resurrection stays out of scope” must survive the deduplication. - Scope. kaval is a per-user machine singleton; odu serve is per-repo — many sockets, one per checkout. The pid-gate takes a scope key; neither program’s choice leaks into the mechanism.
- Lifecycle policy. odu-runner’s idle-timeout auto-serve is a legitimate mode for a CI coordinator and a catastrophic one for a PTY daemon (an idle-timeout kaval kills your terminals). The
daemonMainskeleton exposes lifetime as a parameter (forever|idleTimeout(ms)); each program picks, neither inherits.
Same mechanism, opposite policies — which is precisely the evidence the mechanism is real. A spine that worked for only one lifetime policy or one scope would just be kaval’s internals wearing a package name.
Where each piece lands
Two destinations, split by what the code touches:
| Piece | Destination | Why there |
|---|---|---|
system.version / build-id handshake fragment | @kolu/surface (at S1) | Wire-adjacent — it joins isContractVersionCompatible and the unix-socket pair that
#1084 already moved; every surface client/server pair wants it, daemon or not. |
| Atomic pid-gate (acquire + read sides) | @kolu/surface-daemon (born there, B1) | Runs inside the daemon process (acquire) and the supervisor (read); pure lifecycle, no wire. Both sides of one file format, one home from day one. |
daemonMain skeleton — gate → serve → teardown, lifetime parameter | @kolu/surface-daemon (born there, B1) | The entry every surface daemon repeats; each program supplies its surface and its policy knobs. |
Endpoint supervisor — state machine, driver ops (spawn/waitForPidGone), composed restart | a separate @kolu/surface-daemon-supervisor package (born there, B2 — revised 2026-06-12) | Runs in the client process (kolu-server; the odu CLI or odu-web), and is on a different volatility axis from the daemon half — see the decision below. Shared types (DaemonExit, the gate’s file format) cross a one-directional workspace:* edge: the supervisor imports them from @kolu/surface-daemon. |
Survivable-spawn mechanism — the INVOCATION_ID gate (under a systemd service → systemd-run --user; otherwise, macOS included → detached+unref), per-spawn unique unit names, absolute-path discipline | @kolu/surface-daemon-supervisor (the package’s default DriverOps implementation, born B2) | Host-platform volatility, not program volatility — the program supplies only values: {binPath, args, env, unitPrefix}. kolu’s localDriver.ts shrinks to that parameter bundle. |
One honest trap, inherited from kaval’s one rule: the supervisor’s package boundary IS the staleKey boundary. From B1, @kolu/surface-daemon (the daemon half) is hashed whole into kaval’s key, as a third root beside terminal-protocol (the closure test’s existing multi-root pattern: every root’s non-test files hashed, and the import walk from the daemon entry must reach exactly that set). Whole-package hashing is correct because everything in the package is part of the one daemon binary a restart loads — the serve half in the daemon process, and the durable stdio front (frontDaemonOverStdio, P2.5) in the per-link proxy reached from that binary’s --stdio dispatch — so the package’s standing invariant: only daemon-binary code (serve + front) lives here, never the supervisor. A supervisor file inside this package would flip kaval’s key on every supervisor-only edit — the over-prompting failure A2 killed, reborn.
The decision (settled 2026-06-12; see the decision): the supervisor is a separate package @kolu/surface-daemon-supervisor, not a /supervisor subpath of this one — and it is born that way in B2 (revised the same day; the deferral to an S1 extraction failed its own audit, recorded below). The package boundary is the hash boundary, with nothing to configure — kaval hashes the daemon package and not the supervisor one, and the closure test’s reachable-from-daemon-entry set stays correct by construction (supervisor code is reached only from server, never from bin.ts/index.ts). The rejected alternative — a /supervisor subpath — would force default.nix’s fileFilter to carve src/daemon/** out of src/**, a subdir glob that silently mis-scopes the key when a file lands in the wrong subdir: #1034’s first row, simulated by hand where a package boundary gives it for free.
A separate supervisor package, not a /supervisor subpath
Recorded as a decision so it isn’t relitigated. The two halves — daemon (this package) and supervisor (endpoint/waitForPidGone/restart, its own package from B2) — could ship as one package with two entries, or as two packages. Two packages, because:
- They are on different volatility axes by construction. The entire staleKey design exists to guarantee a supervisor change does not flip the daemon’s key. Two things that must change independently are two modules (Parnas/Lowy); a package is the strongest module boundary there is. The subpath is a weaker encoding of that same boundary — a
fileFilterglob simulating what the package gives for free, and the glob is the fragile part. - The hash boundary falls out for free (the trap paragraph above): package boundary = hash boundary, no subdir glob to drift.
- Shared types ride a normal one-directional edge — the supervisor imports
DaemonExitand the gate’s file-format primitives (gatePid/isHolderLive) from@kolu/surface-daemon. No circular dependency, no third “common” package.
Reasons that were considered and rejected: “it matches how @kolu/surface ships server+client halves” — circular; surface splits for the same shared-types reason, so this isn’t an independent argument. “surface-daemon graduates as a unit” — false; surface-daemon is spine/electricity that stays in the monorepo (only kaval graduates, the drishti/odu path). “the subpath skips the new-package checklist” — the weakest reason, and it’s paid for with the fragile glob. The one thing that could flip the decision: if B2 reveals the daemon and supervisor halves can’t be cleanly separated by imports (a circular type dependency) — but B1’s gate-format split (gatePid/isHolderLive as daemon primitives the supervisor composes, not a supervisor reader living daemon-side) already keeps that seam clean, and a circular dep would mean the spine/soul line itself is wrong, a louder alarm than packaging.
Timing, revised 2026-06-12 (B2 planning): the original sequencing had the supervisor gestate in server/src/ptyHost/ through B2/B3 and extract here at S1. That deferral failed a three-front audit. “Wait for the API to settle” is a backwards-compat argument, and a private workspace package with one downstream — edited in the same PR — carries no compat obligation; B3 reshaping the package costs exactly what B3 reshaping a server directory costs. “Wait for the second consumer” lost to the precedent sitting one section up: the daemon half was born as a package in B1 with one consumer, and the supervisor’s parameter surface is already designed against both consumers in this note. And the risk, quantified: ~five scaffolding files plus one default.nix fileset line, with zero hash surface — the supervisor package is deliberately not a staleKey root, so there is no closure-test or build-id wiring to get wrong (that asymmetry is this decision’s whole point, and it cuts in favor of early birth). What early birth buys: the spine/soul line enforced by the package boundary from the first commit — a zero-kolu-*-deps allowlist, with localDriver.ts physically outside — instead of by reviewer vigilance; the #1275 lesson (its package was extracted mid-PR under review pressure) applied in advance. So B2 births @kolu/surface-daemon-supervisor directly, and S1 shrinks to one move: the handshake fragment into @kolu/surface.
The daemon half’s public API — by example
What B1 actually exports: two modules, small enough to read whole. The shape is normative — it encodes the mechanism/policy line, with every program-specific choice arriving as an argument — while the names may shift in B1’s review.
// @kolu/surface-daemon — the daemon half (all of it, B1)
/** Structural, so the package carries zero kolu-* deps (the kaval pattern). */
export type LogFn = (msg: string, fields?: object) => void;
export type Logger = { debug: LogFn; info: LogFn; warn: LogFn; error: LogFn };
// ── pidGate.ts — the daemon side + the file format both sides share ─────
export type GateResult =
| { kind: "acquired"; release: () => void } // we hold it; release at teardown
| { kind: "held"; pid: number } // a LIVE process holds it
| { kind: "dir-not-private"; dir: string }; // refuse: gate dir isn't owner-only
/** Atomic: validate the gate dir is owner-only, write pid to a temp file,
* link(2) into place; on EEXIST read the gate, liveness-probe, steal if stale. */
export function acquirePidGate(gatePath: string): GateResult;
/** The gate's file format, single-sourced as two daemon-running primitives —
* the pid parse and the liveness probe. B2's supervisor COMPOSES these where it
* lives (`isHolderLive(gatePid(path))`), so no supervisor reader sits in this
* daemon-hashed package. */
export function gatePid(gatePath: string): number | undefined;
export function isHolderLive(pid: number): boolean;
// ── daemonMain.ts — the skeleton: gate → serve → teardown ───────────────
export type DaemonLifetime =
| { kind: "forever" } // kaval: an idle PTY daemon still holds your terminals
| { kind: "idleTimeout"; ms: number; isIdle: () => boolean }; // odu serve: a quiet coordinator may exit
export type DaemonExit =
| { kind: "already-running"; pid: number } // single-instance: this is SUCCESS, the caller exits 0
| { kind: "shutdown"; reason: "sigterm" | "abort" | "idle" };
export function daemonMain(spec: {
gatePath: string; // the scope key: per-user for kaval, per-repo for odu serve
socketPath: string;
router: SurfaceRouter; // any @kolu/surface router — served over the unix-socket listener PR #1084 moved there
lifetime: DaemonLifetime;
log: Logger; // one structured boot line; every transition logged
signal?: AbortSignal; // tests drive teardown without real signals
}): Promise<DaemonExit>; // never calls process.exit — the bin maps DaemonExit to a code
The two consumers, side by side — same mechanism, opposite policies, all arriving as arguments:
// packages/kaval/src/bin.ts — kaval's entire entry (B1)
const exit = await daemonMain({
gatePath: join(kavalRuntimeDir(), "kaval.pid"),
socketPath: cli.socket ?? getPtyHostSocketPath(undefined, "kaval"),
router: servePtyHost({ log, rcDir: kavalRcDir() }).router, // B0's policy-free serving
lifetime: { kind: "forever" },
log,
}); // "held by a live daemon" and "shutdown" are both clean exits
// odu serve (S2, projected) — the second tenant, by substitution only
await daemonMain({
gatePath: join(repoRoot, ".ci/odu.pid"), // per-repo scope
socketPath: join(repoRoot, ".ci/odu.sock"),
router: oduRunnerRouter,
lifetime: { kind: "idleTimeout", ms: 30 * 60_000, isIdle: () => runsInFlight() === 0 },
log,
});
As load-bearing as what’s exported is what is deliberately absent: no env application (B0 removed the daemon’s env role), no spawn/respawn or waitForPidGone (supervisor half — @kolu/surface-daemon-supervisor, its own package from B2), no survival, adoption, or reconciliation (kaval B3’s soul, never spine), and no process.exit inside the mechanism (the gate-race tests run it in-process).
Phases
The trigger discipline matters more than the speed: each half is a package from birth (daemon: B1; supervisor: B2 — revised 2026-06-12 from a post-soak S1 extraction, the timing paragraph above). What soaks in production before the second tenant arrives is the mechanism itself (B2’s recycle, B3’s restart), not its directory location; odu serve (S2) then consumes a boundary that already exists.
- S0 · the daemon half is born as the package (kaval B1)Shipped in
#1301 (2026-06-12). kaval B1 created
@kolu/surface-daemonitself, holding the daemon half —acquirePidGateplus the gate’s file-format primitives (gatePid/isHolderLive) the supervisor composes, and thedaemonMainskeleton (gate → serve → SIGTERM teardown; lifetimeforever | idleTimeout(ms)) — with kaval’s entry a ~20-line composition over it. Rationale: review isolation (the mechanism is reviewed once, as a package) and one home for the gate’s file format. The package is hashed whole into kaval’s staleKey (third root), so its standing invariant is only daemon-running code lives here — no supervisor reader; the supervisor composes the daemon primitives where it lives. kaval’s B1 e2e (the contract corpus over a real daemon’s socket + the gate-race choreography with real processes) is the spine’s first soak harness. - S0.5 · the supervisor half is born as its own package (kaval B2 — shipped, #1310)Shipped in
#1310 (2026-06-12). Revised 2026-06-12 (was: gestate in
server/src/ptyHost/, extract at S1 — the timing paragraph above records why the deferral fell). B2 births@kolu/surface-daemon-supervisorholding endpoint states · driver ops ·waitForPidGone· the composed restart · the survivable-spawn default driver (theINVOCATION_IDgate, unique unit names, detached+unref off-systemd), parameterized over driver ops and the surface client; kaval’s values stay inpackages/server(localDriver.ts, soul — now a parameter bundle: binary · dev-flag filter ·--setenvvalues · paths · unit prefix). Not a staleKey root — zero hash surface by construction. Like its daemon sibling, the package carries a README with the mechanism/soul line, the API table, and a usage example (kolu-server’s composition,odu serve’s projected beside it). B2’s recycle-on-every-deploy then soaks the restart race in production with zero sessions at stake. B3.2 ( #1337 ) then extended the package after birth —serializeRestart(coalesce concurrent restart triggers onto one in-flight recycle) and the transientrestartingstate held across the recycle (holdRestarting) landed in@kolu/surface-daemon-supervisor, the first supervisor mechanism added post-birth and a live proof the boundary holds: kaval’s staleKey stayed bit-identical across it (a supervisor-only change, correctly invisible to the daemon hash). Exit: a daemon-package edit flips kaval’s staleKey while a supervisor-package edit does not — by package boundary, not by a glob. - S0.7 · the durable stdio front lands (kaval-sessions P2.5)Shipped in
#1374 (2026-06-15). kaval-sessions P2.5 upstreamed kaval’s
kaval --stdiodurable-fronting bridge into the package asfrontDaemonOverStdio— the durable counterpart to@kolu/surface’sserveOverStdio: adopt-or-spawn the gate-held daemon and raw-byte-relay an ssh-stdio link onto its socket, so a remote session survives the link (dtach/abducofor any surface daemon). PlusreExecAsDetachedDaemon, the same-binary spawn strategy that carries the single-processnode --importre-exec invariant (soSIGTERMreaches the daemon, not a swallowingtsxfork). kaval’s--stdioshrank to a thin composition (resolve the socket path + supply the daemon-spawn); the front is reached frombin.ts’s--stdiodispatch, so it joins the package’s hashed daemon-binary closure with no staleKey mis-scope — and the standing invariant broadens from daemon-process code to daemon-binary code (serve + front). Exit: the durable remote transport is a named library primitive — not kaval-private — before P3’s remote pair (kaval + kolu-watcher) builds on it. - S1 · the post-B handshake moveOne move left (the supervisor extraction moved into B2, where the package is born): the
system.version/build-id handshake fragment into@kolu/surface, joining the unix-socket pair andisContractVersionCompatiblethat #1084 already moved. Exit: kaval and kolu-server serve and check the same handshake fragment, kaval’s e2e green with zero behavior change. - S2 · odu serve, the second tenantodu-runner R1 builds
odu serveon the spine (per-repo scope key, idle-timeout lifetime as one supported mode); R3 reuses the supervisor half for the home-manager unit and crash semantics. The electricity bar — domain-agnostic, hides hard volatility, second consumer — passes by construction rather than by argument. Exit: odu-runner R1’s “attach with nothing running” ships without odu defining a pid-gate, an entry sequence, or a handshake of its own.