← the Atlas

@kolu/surface-daemon — one spine for kaval and odu serve

feature · seedling ·accepted ·

kaval (the PTY daemon) and odu serve (the long-lived CI coordinator) need the identical lifecycle machinery — pid-gated entry, a unix socket that outlives clients, a contract handshake on every connect, an endpoint state machine, spawn/respawn drivers. This note names that shared spine, says what is mechanism (extract) versus policy (keep per-program), and sequences it: the daemon half is born as the package in kaval B1; the supervisor half is born as its own `@kolu/surface-daemon-supervisor` package in kaval B2 (package boundary = staleKey boundary = spine/soul line); S1 moves the handshake fragment into `@kolu/surface`; odu serve consumes the whole as the second tenant (S2).

Two plans of record arrived at the same machinery from opposite ends in the same week: kaval B1/B2 designs a daemon entry, single-instance gate, handshake, and client-side supervisor for the PTY daemon; odu-runner R1/R3 sketches the same four things for the CI coordinator. This note is the deduplication — named before either hand-rolls a second copy. accepted

Status: accepted — the daemon half shipped as @kolu/surface-daemon in kaval B1 ( #1301 ); the supervisor half is born as @kolu/surface-daemon-supervisor in kaval B2 (revised 2026-06-12; the pty-daemon brief carries both); the durable stdio front frontDaemonOverStdio lands in kaval-sessions P2.5 ( #1374 ) · maturity seedling · first consumer: kaval (B1/B2) · second consumer: odu serve (R1/R3) — the second tenant that proves the electricity bar by construction (S2)

the souls — policy, never sharedthe spine — two packages(daemon half @kolu/surface-daemon: born B1supervisor half @kolu/surface-daemon-supervisor: born B2)@kolu/surface — the wire (shipped)kaval — holds live PTY fds; survival is the point (B3)odu serve — holds replaceable runs; trail + seq-bump rerundaemon half — pid-gate · daemonMain skeletonsupervisor half — endpoint states · drivers · waitForPidGone · restart composeunix-socket pair · getRuntimeSocketPath · isContractVersionCompatible — PR #1084handshake fragment — system.version · build-id consume mechanism, keep policyserves and dials the same wire
Mechanism below the line, policy above it. Both daemons ride the same spine — the daemon half runs inside each daemon process, the supervisor half runs inside whatever spawns and watches it (kolu-server for kaval; the odu CLI / odu-web for odu serve). The wire-adjacent handshake fragment lands in @kolu/surface itself, joining what PR #1084 already moved there.

Two daemons, one spine

The correspondence is one-to-one, and kaval’s column is the more battle-hardened — every row carries a #1034/#1275 scar and a designed answer:

Lifecycle concernkaval (B1/B2 — designed, hazard-annotated)odu serve (R1/R3 — sketched)
Process entrydaemonMain: pid-gate → serve loop → SIGTERM teardownodu serve owns the socket” — same sequence, unnamed
Single instanceAtomic pid-gate — write-temp + link(2), liveness-probe on EEXIST, stale-unlink-retryImplied, undesigned
Socket$XDG_RUNTIME_DIR/kaval/kaval.sock; app-name parameterized in B0.ci/odu.sock, per-repo
Skew safetyContract handshake on every connect; never an import-time throwNeeded identically — a pinned nix run client against a newer resident server
Supervisionendpoint.ts: connecting → connected → degraded → dead, status emitted on every transition, keyed by hostIdodu’s connection cell (copying → connecting → connected) — same idea, fewer states
Spawn/respawnsystemd-run --user + per-spawn unique units / macOS detached+unref; waitForPidGone (ESRCH poll, load-aware ceiling)R3’s home-manager mode needs exactly this
RestartOne composed sequence, steps non-optional in the type, serializedR3’s “kill the server mid-run — nothing wedges”
Honest deathDegraded state visibly distinct from “you have no terminals""errored in the trail, restart serves history”
StalenessstaleKey = nix hash of the daemon closure; “what would a restart gain?”Unaddressed — but a resident odu serve under odu-web is a daemon that can fall a build behind, same question

The asymmetry in maturity is the sequencing argument: kaval’s spine is designed against eight production failures already paid for (the impossible-by-construction table in pty-daemon), and B2’s staged-prod gate will soak the exact restart race (#1034) that odu serve would otherwise rediscover. The spine should be born there, not invented twice.

Mechanism vs policy — what stays per-program

The extraction is safe only because the line between spine and soul is sharp. Three asymmetries between the two consumers are load-bearing, and the spine must parameterize around them rather than absorb them:

Same mechanism, opposite policies — which is precisely the evidence the mechanism is real. A spine that worked for only one lifetime policy or one scope would just be kaval’s internals wearing a package name.

Where each piece lands

Two destinations, split by what the code touches:

PieceDestinationWhy there
system.version / build-id handshake fragment@kolu/surface (at S1)Wire-adjacent — it joins isContractVersionCompatible and the unix-socket pair that #1084 already moved; every surface client/server pair wants it, daemon or not.
Atomic pid-gate (acquire + read sides)@kolu/surface-daemon (born there, B1)Runs inside the daemon process (acquire) and the supervisor (read); pure lifecycle, no wire. Both sides of one file format, one home from day one.
daemonMain skeleton — gate → serve → teardown, lifetime parameter@kolu/surface-daemon (born there, B1)The entry every surface daemon repeats; each program supplies its surface and its policy knobs.
Endpoint supervisor — state machine, driver ops (spawn/waitForPidGone), composed restarta separate @kolu/surface-daemon-supervisor package (born there, B2 — revised 2026-06-12)Runs in the client process (kolu-server; the odu CLI or odu-web), and is on a different volatility axis from the daemon half — see the decision below. Shared types (DaemonExit, the gate’s file format) cross a one-directional workspace:* edge: the supervisor imports them from @kolu/surface-daemon.
Survivable-spawn mechanism — the INVOCATION_ID gate (under a systemd service → systemd-run --user; otherwise, macOS included → detached+unref), per-spawn unique unit names, absolute-path discipline@kolu/surface-daemon-supervisor (the package’s default DriverOps implementation, born B2)Host-platform volatility, not program volatility — the program supplies only values: {binPath, args, env, unitPrefix}. kolu’s localDriver.ts shrinks to that parameter bundle.

One honest trap, inherited from kaval’s one rule: the supervisor’s package boundary IS the staleKey boundary. From B1, @kolu/surface-daemon (the daemon half) is hashed whole into kaval’s key, as a third root beside terminal-protocol (the closure test’s existing multi-root pattern: every root’s non-test files hashed, and the import walk from the daemon entry must reach exactly that set). Whole-package hashing is correct because everything in the package is part of the one daemon binary a restart loads — the serve half in the daemon process, and the durable stdio front (frontDaemonOverStdio, P2.5) in the per-link proxy reached from that binary’s --stdio dispatch — so the package’s standing invariant: only daemon-binary code (serve + front) lives here, never the supervisor. A supervisor file inside this package would flip kaval’s key on every supervisor-only edit — the over-prompting failure A2 killed, reborn.

The decision (settled 2026-06-12; see the decision): the supervisor is a separate package @kolu/surface-daemon-supervisor, not a /supervisor subpath of this one — and it is born that way in B2 (revised the same day; the deferral to an S1 extraction failed its own audit, recorded below). The package boundary is the hash boundary, with nothing to configure — kaval hashes the daemon package and not the supervisor one, and the closure test’s reachable-from-daemon-entry set stays correct by construction (supervisor code is reached only from server, never from bin.ts/index.ts). The rejected alternative — a /supervisor subpath — would force default.nix’s fileFilter to carve src/daemon/** out of src/**, a subdir glob that silently mis-scopes the key when a file lands in the wrong subdir: #1034’s first row, simulated by hand where a package boundary gives it for free.

A separate supervisor package, not a /supervisor subpath

Recorded as a decision so it isn’t relitigated. The two halves — daemon (this package) and supervisor (endpoint/waitForPidGone/restart, its own package from B2) — could ship as one package with two entries, or as two packages. Two packages, because:

Reasons that were considered and rejected: “it matches how @kolu/surface ships server+client halves” — circular; surface splits for the same shared-types reason, so this isn’t an independent argument. “surface-daemon graduates as a unit” — false; surface-daemon is spine/electricity that stays in the monorepo (only kaval graduates, the drishti/odu path). “the subpath skips the new-package checklist” — the weakest reason, and it’s paid for with the fragile glob. The one thing that could flip the decision: if B2 reveals the daemon and supervisor halves can’t be cleanly separated by imports (a circular type dependency) — but B1’s gate-format split (gatePid/isHolderLive as daemon primitives the supervisor composes, not a supervisor reader living daemon-side) already keeps that seam clean, and a circular dep would mean the spine/soul line itself is wrong, a louder alarm than packaging.

Timing, revised 2026-06-12 (B2 planning): the original sequencing had the supervisor gestate in server/src/ptyHost/ through B2/B3 and extract here at S1. That deferral failed a three-front audit. “Wait for the API to settle” is a backwards-compat argument, and a private workspace package with one downstream — edited in the same PR — carries no compat obligation; B3 reshaping the package costs exactly what B3 reshaping a server directory costs. “Wait for the second consumer” lost to the precedent sitting one section up: the daemon half was born as a package in B1 with one consumer, and the supervisor’s parameter surface is already designed against both consumers in this note. And the risk, quantified: ~five scaffolding files plus one default.nix fileset line, with zero hash surface — the supervisor package is deliberately not a staleKey root, so there is no closure-test or build-id wiring to get wrong (that asymmetry is this decision’s whole point, and it cuts in favor of early birth). What early birth buys: the spine/soul line enforced by the package boundary from the first commit — a zero-kolu-*-deps allowlist, with localDriver.ts physically outside — instead of by reviewer vigilance; the #1275 lesson (its package was extracted mid-PR under review pressure) applied in advance. So B2 births @kolu/surface-daemon-supervisor directly, and S1 shrinks to one move: the handshake fragment into @kolu/surface.

The daemon half’s public API — by example

What B1 actually exports: two modules, small enough to read whole. The shape is normative — it encodes the mechanism/policy line, with every program-specific choice arriving as an argument — while the names may shift in B1’s review.

// @kolu/surface-daemon — the daemon half (all of it, B1)

/** Structural, so the package carries zero kolu-* deps (the kaval pattern). */
export type LogFn = (msg: string, fields?: object) => void;
export type Logger = { debug: LogFn; info: LogFn; warn: LogFn; error: LogFn };

// ── pidGate.ts — the daemon side + the file format both sides share ─────
export type GateResult =
  | { kind: "acquired"; release: () => void }   // we hold it; release at teardown
  | { kind: "held"; pid: number }               // a LIVE process holds it
  | { kind: "dir-not-private"; dir: string };   // refuse: gate dir isn't owner-only
/** Atomic: validate the gate dir is owner-only, write pid to a temp file,
 *  link(2) into place; on EEXIST read the gate, liveness-probe, steal if stale. */
export function acquirePidGate(gatePath: string): GateResult;
/** The gate's file format, single-sourced as two daemon-running primitives —
 *  the pid parse and the liveness probe. B2's supervisor COMPOSES these where it
 *  lives (`isHolderLive(gatePid(path))`), so no supervisor reader sits in this
 *  daemon-hashed package. */
export function gatePid(gatePath: string): number | undefined;
export function isHolderLive(pid: number): boolean;

// ── daemonMain.ts — the skeleton: gate → serve → teardown ───────────────
export type DaemonLifetime =
  | { kind: "forever" }                       // kaval: an idle PTY daemon still holds your terminals
  | { kind: "idleTimeout"; ms: number; isIdle: () => boolean }; // odu serve: a quiet coordinator may exit
export type DaemonExit =
  | { kind: "already-running"; pid: number }  // single-instance: this is SUCCESS, the caller exits 0
  | { kind: "shutdown"; reason: "sigterm" | "abort" | "idle" };
export function daemonMain(spec: {
  gatePath: string;       // the scope key: per-user for kaval, per-repo for odu serve
  socketPath: string;
  router: SurfaceRouter;  // any @kolu/surface router — served over the unix-socket listener PR #1084 moved there
  lifetime: DaemonLifetime;
  log: Logger;            // one structured boot line; every transition logged
  signal?: AbortSignal;   // tests drive teardown without real signals
}): Promise<DaemonExit>;  // never calls process.exit — the bin maps DaemonExit to a code

The two consumers, side by side — same mechanism, opposite policies, all arriving as arguments:

// packages/kaval/src/bin.ts — kaval's entire entry (B1)
const exit = await daemonMain({
  gatePath:   join(kavalRuntimeDir(), "kaval.pid"),
  socketPath: cli.socket ?? getPtyHostSocketPath(undefined, "kaval"),
  router:     servePtyHost({ log, rcDir: kavalRcDir() }).router, // B0's policy-free serving
  lifetime:   { kind: "forever" },
  log,
}); // "held by a live daemon" and "shutdown" are both clean exits

// odu serve (S2, projected) — the second tenant, by substitution only
await daemonMain({
  gatePath:   join(repoRoot, ".ci/odu.pid"),   // per-repo scope
  socketPath: join(repoRoot, ".ci/odu.sock"),
  router:     oduRunnerRouter,
  lifetime:   { kind: "idleTimeout", ms: 30 * 60_000, isIdle: () => runsInFlight() === 0 },
  log,
});

As load-bearing as what’s exported is what is deliberately absent: no env application (B0 removed the daemon’s env role), no spawn/respawn or waitForPidGone (supervisor half — @kolu/surface-daemon-supervisor, its own package from B2), no survival, adoption, or reconciliation (kaval B3’s soul, never spine), and no process.exit inside the mechanism (the gate-race tests run it in-process).

Phases

The trigger discipline matters more than the speed: each half is a package from birth (daemon: B1; supervisor: B2 — revised 2026-06-12 from a post-soak S1 extraction, the timing paragraph above). What soaks in production before the second tenant arrives is the mechanism itself (B2’s recycle, B3’s restart), not its directory location; odu serve (S2) then consumes a boundary that already exists.