← the Atlas

R-4 — kaval, the standalone PTY daemon

feature · budding ·accepted ·

The redo plan for R-4 — local PTY daemonization, reframed: the daemon is kaval, a standalone program in the drishti/odu tradition, and kolu is its first client. One rule (package boundary = process boundary = staleKey hash), a dumb fully-specified wire, and four PRs — the spawn-policy inversion, the kaval binary + client, the door with survival off, then survival — each complete w.r.t. the hazards it opens. Carries the #1034 postmortem and its hard constraints.

The R-4 plan of record — local PTY daemonization + @kolu/pty-host. Split out of the parent remote-terminals plan on 2026-05-30; the parent keeps R-1/R-1.5/R-1.6 (shipped) and R-2/R-3 (post-R-4), and now also owns the multi-host direction (1 local + N ssh-remote pty-hosts, host switching in the ChromeBar) that this plan must not foreclose. Current state: A1 ( #1055 ), A2 ( #1063 ), B0 ( #1292 ), B1 ( #1301 ), B2 ( #1310 ), B3.1 ( #1330 ), B3.2 ( #1337 ), B3.3 ( #1344 ), and B3.4 ( #1353 ) are on master. The first Phase B build ( #1275 ) shipped the full feature as one 40-commit PR, was verified live in production, and was then deliberately discarded: functional, but the architecture was discovered in review rather than designed. This note is the redo plan — same user-facing functionality, a designed architecture, and a PR split sized so each PR can be executed end-to-end by a single agent session. Reframed 2026-06-12: the daemon is not a kolu module that happens to run in its own process — it is kaval (Tamil kāval — watch, guard), a standalone program in the tradition of drishti and odu, with kolu as its first and biggest client. The split is now four PRs: B0 (the spawn-policy inversion) → B1 (the kaval binary + client) → B2 (the door) → B3 (survival). B0 shipped in #1292 (2026-06-12) — the wire is fully specified and the package carries zero kolu-* deps; B1 shipped in #1301 (2026-06-12) (the rename to kaval + kaval-tui, the daemon binary, @kolu/surface-daemon, full e2e); B2 — the door — shipped in #1310 (2026-06-12) (the topology flip, @kolu/surface-daemon-supervisor, the honest degraded state + the live KAVAL rail column/dialog); B3 — survival — a four-PR chain now under way (B3.1, the seam-carving refactor, shipped in #1330 ; B3.2, the supervised session-preserving restart, shipped in #1337 ; B3.3, adoption — terminals survive a deploy that didn’t change kaval’s source — shipped in #1344 ; B3.4, the currency nudge, shipped in #1353 the chain complete) after a one-big-PR attempt ( #1326 ) was closed (too big; two blocking data-loss bugs survived its own review). The same week, surface-daemon ( #1294 ) named the daemon-lifecycle machinery B1/B2 build as a shared spine with odu serve as its second tenant — and B1 now ships the spine’s daemon half as @kolu/surface-daemon from the get-go (the review-isolation decision below), plus full e2e coverage for the kaval + kaval-tui pair; the supervisor half is born as its own package — @kolu/surface-daemon-supervisorin B2 itself (revised 2026-06-12 on B2 planning; S1 shrinks to the handshake move).

Companions: the srv · pty ChromeBar rail (A2’s deliverable, shipped) and kolu-tui (Phases 0–2 shipped — the CLI that dials the pty-host unix socket, today served by the in-process server, after B2 by the daemon; renamed kaval-tui in B1, when it gains a bin of its own).

The architecture — one rule, then a module map

kolu web UI(browser)kolu-server — the session brain(restarts every deploy)kaval — dumb-but-durable PTY daemon(hashed whole: the staleKey)kaval-tui(reference client)supervisorspine: @kolu/surface-daemon-supervisor (born B2)endpoint · waitForPidGone · restart · survivable spawnsoul: src/ptyHost/ localDriver params · src/terminalBackend reconcile + adoptspawn policy (kolu-pty)cleanEnv · identity env · shell-initprovider DAGFRESH each deploydaemonMainpid-gate · own rcDir · serve loopcreatePtyHost · tapssurface contract · unix socket websocket surfacespawns + supervisesspawn {argv, env, initFiles}raw tapsdials the socket
The module map, post-reframing. kaval (today @kolu/pty-host) is exactly the code that runs in the daemon process — package boundary = process boundary = staleKey hash, zero file-level exceptions. All spawn policy (cleanEnv · identity env · shell-init rcfiles, i.e. kolu-pty) lives client-side in kolu-server and reaches kaval as data — spawn {argv, env, initFiles} — never as code. The supervisor's spine (endpoint · waitForPidGone · restart · the survivable-spawn mechanism) is its own un-hashed package — @kolu/surface-daemon-supervisor, born in B2 — keyed by hostId from day one; what stays in kolu-server is the soul (localDriver's kaval parameter bundle, B3's reconcile), so R-2's ssh driver is an additive sibling that provisions the same kaval closure via surface-nix-host. kaval-tui dials the same socket kolu does.

The one rule everything follows from: the package boundary IS the process boundary IS the staleKey hash set. kaval (today @kolu/pty-host; renamed in B1) contains exactly the code that executes inside the daemon — the PTY primitive, the wire contract, the taps, the socket serving, and the process entry (pid-gate acquisition, the daemon’s own root/rcDir, exit handling). The staleKey is “the nix hash of the package dirs that run in the daemon” — kaval whole, plus its daemon-side workspace roots (terminal-protocol today; surface-daemon’s daemon half after B1), each hashed whole with zero file-level exceptions — and buildId.closure.test.ts walks the import graph from kaval’s two entries (index.ts, the embedded surface, + the daemon bin.ts) — the closure test now answers the exact question the staleKey asks: what would a restart gain? Everything that supervises from outside lives outside the hash — correctly, because changing the supervisor never changes what a restart would gain: the spine (endpoint states, waitForPidGone, the composed restart, the survivable-spawn mechanism) in @kolu/surface-daemon-supervisor (born B2, deliberately not a staleKey root), the soul (localDriver’s kaval parameter bundle, B3’s reconcile) in packages/server. And after B0, so does all spawn policy — it crosses the wire as data, so changing kolu’s shell arcana never forces a daemon restart either.

kaval — a program of its own; kolu, its first client

Three insights (2026-06-12) reframe what the daemon is, without changing what Phase B builds:

  1. The graduation pattern already exists. drishti grew out of the remote-process-monitor example; odu grew out of mini-ci; both are surface agents that graduated from kolu’s monorepo once their domain stabilized. kaval — Tamil kāval, watch/guard: the thing that stands watch over your terminals — is the third graduate, and the most natural, since ptyHostSurface already is a defineSurface() declaration. It has been a surface app all along, just trapped inside kolu-server’s process. B1 makes it graduation-ready (own bin, own socket, zero kolu deps); actual extraction to its own repo waits until the domain stops churning, exactly like the other two.

  2. The layering is: dumb-but-durable kaval ← kolu the session brain ← kolu the web UI. kaval holds fds, mirrors screens, serves taps — and nothing else. kolu-server remains a substantive middle tier, deliberately: the #1031 postmortem is binding (daemonizing the provider DAG served stale detection on every deploy), so session persistence, reconciliation, the provider DAG, and all spawn policy stay kolu’s domain, re-run fresh against surviving PTYs. kolu is not a thin client — but it is a client, one of several (kaval-tui today, an MCP face later).

  3. Spawn policy crosses the wire as data, never as code. Today cleanEnv/koluIdentityEnv/prepareShellInit are baked into the daemon’s spawn handler — kolu’s sensor system (the OSC 7/2/633 hooks powering cwd tracking, foreground detection, and agent awareness) implanted host-side. B0 inverts this: kaval exposes host facts (system.info → {shell, home, platform, rcDir}), and spawn takes the full specification {argv, env, initFiles} — the daemon writes the rcfiles it is handed and asks no questions. TERM_PROGRAM=kolu is asserted by the tier that renders pixels, not the tier that holds fds — tmux doesn’t claim to be your terminal emulator either. A bare kaval client spawns a plain shell; the tap channels just stay quiet. The inversion is also what makes a remote kaval possible: shell-init content for a host that isn’t kolu’s own machine is computable from system.info, and the rcfiles land daemon-side where kolu’s hands can’t reach.

The remaining axes, most stable → most volatile (the boundaries that own them):

AxisRateBoundary that owns it
Everything in the daemon processrarelykaval (today @kolu/pty-host) — hashed whole, alongside its daemon-side roots (terminal-protocol; surface-daemon’s daemon half post-B1); the entry travels with the package, so R-2’s remote daemon gets single-instance arbitration for free.
Wire contract + compat + identityper wire changeptyHostSurface + PTY_HOST_CONTRACT_VERSION + PtyHostIdentity, in-package (unchanged from A1/A2; bumped to 3.0 in B0 ✅ for the fully-specified spawn + system.info).
Spawn policy — env basis, identity vars, shell-init rcfilesper shell/OS quirkClient-side: kolu-pty consumed by packages/server; crosses the wire as data (spawn {argv, env, initFiles}), never as code. kaval stays host-agnostic.
Identity values: staleKey vs navigableCommitkey: per wire change · commit: per deploybuildId.ts (unchanged). Currency is derived at the read site (staleKey !== currentBuildId()), never stored, never frozen onto a connection.
Endpoint status (the one health owner)per endpoint-model change@kolu/surface-daemon-supervisor’s endpoint.ts (born B2) — a single-host endpoint per spec (keyed implicitly by its spec.hostId), emitting {state, identity, startedAt} on every transition via onStatus(hostId, …). The per-hostId map lives server-sidedaemonStatus.ts’s Map<hostId, DaemonStatus> (a map of one, today: local); everything else — buildInfo, rail, degraded canvas — derives by subscription.
Survivable-spawn mechanicsper platform quirk@kolu/surface-daemon-supervisor’s default DaemonDriver (survivableSpawnDriver(DaemonSpawnConfig)) — the fromSource gate (!fromSource && INVOCATION_ID set → systemd-run --user with per-spawn unique unit names; otherwise, macOS included → detached+unref); waitForPidGone (ESRCH poll, load-aware ceiling); composes the gate’s gatePid/isHolderLive.
kaval’s reach valuesper kolu packaging changelocalDriver.ts in packages/server — the parameter bundle: the kaval binary from the kolu closure, dev-flag exec-arg filter, unit prefix, socket/gate paths, the --setenv set (shrunk to daemon-operational vars (XDG_RUNTIME_DIR) — PTY env arrives per-spawn on the wire after B0). R-2 adds sshDriver as a sibling (HostSession reach + provisionAgent, shipping the same kaval closure).
Recovery sequenceper recovery policyrestart.ts — one composed capture → drain → recycle → reattach, where recycle is endpoint.ensure() (kill → waitForPidGone → spawn → connect); the three caller steps (RestartSteps) are non-optional in the type, with save folded into capture. The forced and user paths are the same function. (B3.2 ✅ #1337 added serializeRestart — coalesces concurrent callers onto one in-flight recycle — and the restarting state, held across the whole sequence by endpoint.holdRestarting; ENDPOINT_STATES is now connecting/connected/restarting/degraded/dead.)
Reconciliation + adoptionper persisted fieldreconcile.ts (a pure function: daemon.list() × savedSession → {adopt, adoptOrphans}) + adoptTerminal as a sibling of the existing spawnAndWire in local.ts (adopt and spawn converge on one post-wire path). Adoption consumes the whole SavedTerminal record as a unit — never field-by-field reconstruction — and a schema-level round-trip test iterates the schema’s keys.
Health → UI projectionper UX tweakA per-host status collection on the surface; rail + DegradedCanvas + restart dialog subscribe. User-facing name: “kaval” — it’s a thing of its own, and the name says so.

The daemon-package question is settled in two halves (revised 2026-06-12, after #1294 ). surface-daemon names the shared spine — atomic pid-gate, the daemonMain skeleton (gate → serve → SIGTERM teardown), the system.version/build-id handshake fragment, and the endpoint supervisor (state machine · spawn/waitForPidGone drivers · composed restart) — and identifies the second tenant: not R-2’s ssh driver (still a sibling behind the endpoint concept, not a consumer of the local driver’s code) but odu serve, odu-runner’s long-lived CI coordinator, which needs the identical four pieces. The split: the daemon half is born as @kolu/surface-daemon in B1 itself; the supervisor half is born as its own separate @kolu/surface-daemon-supervisor package in B2 itself (the decision on shape — a package, never a /supervisor subpath — stands; what fell on B2 planning, 2026-06-12, was the deferral that had it gestate in server/src/ptyHost/ until an S1 extraction). Creating the daemon package from the get-go buys two things — review isolation (the mechanism is reviewed once, as a self-contained package; the rest of B1 stays renames plus a ~20-line composition) and one home for the gate’s file format (acquirePidGate for the daemon, plus the gatePid/isHolderLive primitives B2’s supervisor composes where it lives — so no supervisor reader sits in the daemon-hashed package). The usual one-consumer objection is weak here: the parameter surface (scope key, socket path, the router, lifetime forever | idleTimeout) is already designed against both consumers in the surface-daemon note, and a private workspace package is cheap to re-cut when odu serve arrives. The staleKey constraint shapes where, never when: the daemon package is hashed whole into kaval’s key, as a third root beside terminal-protocol (the closure test’s existing multi-root pattern) — correct because everything in it is part of the one daemon binary a restart loads (the serve half in the daemon process, and since P2.5 the durable stdio front frontDaemonOverStdio in the per-link proxy reached from bin.ts’s --stdio dispatch). A supervisor file in the daemon package would flip kaval’s key on every supervisor-only edit (A2’s over-prompting failure, reborn), so the daemon package carries a standing invariant — only daemon-binary code (serve + front) may live here — and the supervisor gets its own package, where the package boundary is the hash boundary with no subdir glob to mis-scope. The original sequencing deferred that package to S1; on B2 planning it failed its own audit. “Wait for the API to settle” is a backwards-compat argument, and a private workspace package with one downstream — edited in the same PR — has no compat obligation. “Wait for the second consumer” lost to the very precedent it sat beside: the daemon package was born in B1 with one consumer. And the quantified cost is ~five scaffolding files plus one default.nix fileset line, with zero hash surface — the supervisor package is deliberately not a staleKey root, so there is no closure-test or build-id wiring to get wrong. What early birth buys is the same review isolation that won in B1, plus the spine/soul trap enforced by the package boundary from the first commit instead of by reviewer vigilance — the #1275 mid-PR-extraction lesson applied in advance. S1 shrinks to one move: the handshake fragment into @kolu/surface.

Multi-host readiness (the shapes, not the feature)

The parent note owns the direction: 1 local + N ssh-remote pty-hosts, host switching in the ChromeBar, reattach to all of them. Phase B ships exactly one endpoint but must be host-count-agnostic in its shapes: the endpoint map is keyed by hostId; daemon status is a per-host collection (not a singleton cell); adoption and the saved session join on id today, shaped to become (host, id) in R-2; the TerminalLocation discriminator from R-1 already sits at the single getTerminalBackendFor dispatch seam, ready for the persisted location field R-2 threads onto records; and the daemon-side code is one hashed package that runs identically on a remote host. None of this costs more than the singleton version; retrofitting any of it later costs another #1275.

The kaval reframing collapses R-2’s provisioning story into existing machinery: a remote pty-host is the same kaval closure shipped by @kolu/surface-nix-hostnix copy, realise, run — exactly how odu provisions odu-runner onto lane hosts and drishti its agent. B0’s system.info is what makes spawn policy computable for a host that isn’t kolu’s own machine; B0’s initFiles is what lets the rcfiles land on a disk kolu’s hands can’t reach.

What this makes impossible by construction

The production failures — four from #1034, four more paid for during #1275 — each become a failing test at the phase that owns the concept:

Production failureKilled atHow
Mis-scoped staleness key (wire change didn’t flip it)A2 ✅Closure-scoped key + import-walk guard (now re-rooted at the daemon entry in B1).
Over-prompting (key nudged on every deploy)A2 ✅Server-only change leaves the key bit-identical — a falsifiable test.
Data-loss restart (kill-then-pray, #1034)B3.2 ✅One composed restart, snapshot-before-kill is the capture step (setSavedSessionFromSnapshot); the drain fires no terminals:dirty, so no autosave can clobber the capture.
Empty-canvas lie (dead daemon → “no terminals”)B2Honest dead/degraded state ships with the door, before any survival promise exists.
Lossy adoption (#1275: splits un-nested, then agent-resume lost and autosaved, poisoning cold restores)B3.3Adoption consumes the whole persisted record; a schema-key-iterating round-trip test closes the class; a non-survivor is an exited shell, dropped as handleExit already does — never a hidden, autosave-clobbered restore card.
Lazy adopt-on-spawn (#1275: orphan shells respawned into the daemon — duplicated terminals)B3.3Eager boot-time reconciliation is the only adopt path; a live PTY with no saved record (a create that raced the debounced autosave) is adopted from the live snapshot, never re-spawned (so no duplicate), and never reaped (so it survives the redeploy); exited shells are dropped (not respawned).
Identity gone stale after restart (#1275: one-shot read, then manual republish at enumerated call sites)B2One status owner emitting on every transition; everything derives by subscription.
Contract-skew crash-loop masked by launchd as an “App updated” loop (the zest field report on #1275 )B2Version checked over the socket on every connect, never an import-time throw; skew at boot → controlled recycle (B2) / composed restart (B3.2).

The PRs — each an executable brief

No feature flags — each PR ships complete to master and leaves kolu working. The hazard rule, refined from the first attempt: each PR must be complete w.r.t. the hazards it opens — B0 opens none (an in-process refactor with byte-identical behavior), and the door can open with the survival promise off, which empties its hazard set by policy. B0–B2 shipped as single PRs; B3 is itself a four-PR chain (below), each self-sufficient, after the one-big-PR attempt ( #1326 ) was closed.

B0 — the inversion: dumb wire, client policy shipped

Goal: flip who owns spawn policy. The pty-host’s wire becomes fully specified — the daemon spawns exactly what it is told and asks no questions — and every kolu-ism moves to (or stays in) kolu’s tier. In-process only; zero user-visible change; zero new processes. This is the contract change, made at the one moment it is free: no daemon exists yet, no survivors, both consumers in-repo. Shipped in #1292 (merged 2026-06-12) — contract bumped to 3.0; every Verify item below held (CI green on both platforms; the closure test’s tightened allowlist now enforces the dependency diet). The review gauntlet’s one real find, folded in: init files are cleaned up on partial-write and on spawn failure, not just on PTY exit.

Deliverables:

Traps: behavior parity IS the deliverable — golden-test that the assembled rcfile content reaching the PTY is byte-identical before/after the inversion; no policy may remain daemon-side (a test walks the package’s import graph and fails on any kolu-pty/kolu-common/kolu-shared hit); the staleKey flips on this deploy (a wire change — correct, not a regression).

Verify: e2e green — titles, cwd tracking, agent detection all ride the relocated OSC hooks, so the existing features are the test; packages/pty-host/package.json shows zero kolu-* workspace deps; a server-only change after this PR still leaves the staleKey bit-identical.

B1 — kaval: the binary and its client shipped · #1301

Goal: the daemon becomes a thing of its own — packages/pty-hostkaval, packages/pty-tuikaval-tui, both with bin entries, runnable as a pair on a machine where kolu has never been installed. Zero changes to packages/server beyond mechanical import renames. In review in #1301 .

Deliverables:

Traps: no import-time throws anywhere; no top-level await; neither kaval nor surface-daemon may import from packages/server. (The first attempt’s socket-collision trap dissolves by construction: kaval’s default path differs from the in-process kolu path.) The package boundary IS the spine/soul line now: nothing kaval-specific enters @kolu/surface-daemon (kaval’s choices arrive as parameters), and nothing supervisor-side enters it either — the whole-package staleKey hash forbids it, standing; the supervisor half gets its own un-hashed package in B2.

Verify: the coverage ledger passes — every contract procedure and stream exercised against a real spawned daemon over the socket. On a clean box (a pu box is ideal): run kaval, spawn + drive + detach + reattach a shell with kaval-tui — kolu nowhere in the picture (the human smoke over what the e2e suite automates). A server-only change leaves KAVAL_BUILD_ID bit-identical across two nix builds; touching daemonMain.ts or anything in surface-daemon flips it (that code runs in the daemon — correct, not a regression).

B2 — the door, with survival off shipped · #1310

Goal: flip the topology — the server becomes a client of a daemon it spawns — while keeping user-facing semantics byte-identical to today: the boot policy is always recycle (connect-if-survivor → kill → waitForPidGone → spawn fresh → connect). No survivors exist, so no survival hazard can open: no orphans, no skew older than one boot, nothing for a restart to pray over. Every production deploy now exercises kill → wait-for-real-exit → respawn — the exact race #1034 lost — with zero sessions at stake.

Deliverables:

Traps: no mode flag — the socket is the path (a KOLU_DAEMON_MODE toggle is a feature flag in a trench coat); do not implement adoption (assert terminal.list() is empty after the recycle); do not widen any shared schema; restore-card-on-deploy behavior must be exactly today’s. The surface-daemon constraint, supervisor half: endpoint.ts’s state machine, waitForPidGone, restart.ts’s composed sequence, and the survivable-spawn mechanism are spine and live in @kolu/surface-daemon-supervisor from birth (odu serve’s CLI reuses them at S2) — the package boundary enforces what was previously a review-vigilance rule. The line through the driver: the incantation is spine, the values are soul — the INVOCATION_ID gate, the systemd-run/detached fork, and the unique-unit-name discipline ship in the package; everything kaval-supplied (the binary, the dev-flag filter, the --setenv values, the paths, the unit prefix) stays inside localDriver.ts in packages/server, the one file that is soul — it cannot drift into the spine without crossing a package edge the dependency allowlist refuses.

Verify: e2e green on both platforms (the darwin lane’s presence is itself a check — its absence was once silent); kaval-tui works against the daemon with no flags; pkill-ing kaval mid-session shows the honest dead state; a staged deploy shows the recycle in the journal.

B3 — survival: four PRs, not one complete — B3.1–B3.4 shipped (currency · #1353 )

Goal: terminals survive a kolu update. The first build did it as one ~1800-line PR ( #1326 , closed) — too big to review (two blocking data-loss bugs survived its own gauntlet), the spine grew mid-implementation, and a misframed edge case (partial survival) dragged a race-sensitive autosave swamp into the diff that wasn’t even needed (see the crux). The redo is a shallow chain of four PRs — one refactor, then one capability each.

B3.1 · refactor ✓ #1330carve the seams (spine + server)B3.2 · supervised restart ✓ #1337capture→drain→recycle · F1 receptacleB3.3 · adoption ✓ #1344terminals survive a deployB3.4 · currency nudge ✓ #1353CI-gated reuse capture+restoreclick targetreachability
The B3 chain. One pure refactor (blue) carves the seams; then supervised-restart — fully CI-testable, and it finishes B2's deferred Restart-kaval button — lands BEFORE adoption (the staged-prod headline) so adoption reuses the proven capture+restore plumbing; currency is last, gated on a CI proof that the build-id reaches the server. adoptOrEnsure ⊥ serializeRestart (no call-edge) is what licenses splitting the spine across B3.2/B3.3.
PRKindShips (user-visible)Spine symbol — frozen up frontHazard killed by constructionDep
B3.1 #1330 refactornonespine: extract the ensure() helpers (liveServingHolder · killLiveHolder · spawnConnectHold); server: extract killHalfWiredPty (the F2 reap receptacle) + dedup the snapshot shape into one SessionSnapshot type (producer + autosave)byte-identical (endpoint.test.ts is the gate); every extraction has a live consumer — the F1 receptacle (setSavedSessionFromSnapshot) ships in B3.2 beside its consumer, not as bare future-API here
B3.2 #1337 featuresession-preserving restart of kaval — on a running daemon (pick up a new build · user-initiated) and a dead/degraded one (recover); finishes B2’s deferred “Restart kaval” buttonrestarting + serializeRestart + emit-guard; introduces setSavedSessionFromSnapshot (the F1 receptacle) beside its restart-capture consumerF1 (the empty→null guard + its autosave-cancel test land here, with the consumer) · F3 · F4; fully CI/e2e-testable (recycle→fresh, restore from the empty canvas)B3.1
B3.3 #1344 featurelive PTYs — process + scrollback + running agent — survive a deploy that didn’t change kaval’s source (staleKey unchanged — the common case)adoptOrEnsure — reconcile before wiring#1275 whole-record adopt + F2; a non-survivor is an exited shell → dropped (kolu’s normal handleExit), never restore-cardedB3.1–2
B3.4 #1353 featureamber ⬆ update pending when kaval is a build behind → one-click recyclenone (reads existing identity.staleKey)#1034 over-prompting (keyed on closure hash only); CI gate: build-id reaches the serverB3.2–3

Why B3.4 stands alone: it’s a distinct capability (the nudge + currency derivation) with its own hazard (#1034 over-prompting — never nudge when a deploy left kaval’s staleKey unchanged) and its own Nix gate, and it depends on both B3.2 (the restart it fires) and B3.3 (reachability — a build-behind daemon only exists once adoption keeps one alive; under always-recycle there’s nothing to nudge about). Its executable brief is B3.4 — currency nudge, below.

The crux — the partial case dissolves; #1326’s swamp was a conflation. Every #1326 bug and the session.ts doubling came from one mistake: treating an adopt-case non-survivor like a restart-case one. They’re different:

CaseA non-survivor is…Right move
B3.2 restart — daemon killed, all PTYs diea terminal you still want (you didn’t close it)restore it (re-spawn from the captured session) — on the empty canvas, so no live survivors, no autosave race
B3.3 adoption — daemon survived, a PTY is gonean exited shell (its process ended in the restart window)drop it — exactly what kolu’s handleExit already does when a shell exits with the server up

So adoption’s “partial” case is trivial — adopt the live, drop the exited — with no restore card, no recycle, no autosave-durability machinery. #1326 mis-applied the restart-case restore card to the adopt case, which forced the pendingRestoreCard / union / session.restored cluster that burned four codex rounds for a problem that doesn’t exist. (If kolu ever keeps exited terminals around instead of closing them, revisit — that’s a separate feature.)

What upstreams to the spine libraries

@kolu/surface-daemon / @kolu/surface-daemon-supervisor are our libraries with our consumers (kolu today; odu serve next). We change them freely — no backwards-compat tax, and one consumer today is never a reason to keep mechanism in kolu. Program-agnostic mechanism upstreams; only kolu’s session/terminal policy stays soul. The split is the same line every PR draws: the spine adopts / recycles / serializes a connection; kolu reconciles that connection’s contents.

Upstreams → @kolu/surface-daemon-supervisor (mechanism)Stays in kolu-server (soul — session/terminal policy)
B3.3 adoptOrEnsure() — adopt a live, handshake-compatible survivor; recycle only an absent / dead / genuinely-skewed one. F4: a live survivor is killed only on a typed DaemonContractSkewError the soul’s connect raises (the one failure that proves incompatibility); a non-skew connect failure (transport dial / unreadable handshake) is retried and, if it persists, the survivor is left up (reported degraded) — never killed — so a daemon we merely cannot reach right now keeps its live PTYs. Generic: it adopts-or-recycles via the injected connect, branching only on the soul’s typed skew marker, knowing nothing of PTYsreconcile.tswhich daemon PTYs map to which saved terminals; adoptTerminal (a sibling of spawnAndWire) whole-record adoption
B3.2 restarting state + serializeRestart (coalesce concurrent triggers) + the emit-guard (hold restarting across the inner recycle)setSavedSessionFromSnapshot (introduced here, beside its consumer — moved out of B3.1 as bare future-API) + the daemon.restart RPC
B3.1 ensure() helper extraction · already there from B2: composed restart, waitForPidGone, the driver, per-transition identity reportingkillHalfWiredPty (the reap receptacle) · SessionSnapshot (one snapshot type spanning producer + autosave)

Every PR also: names its spine symbol up front and ships it with its consumer (#1326’s spine doubled mid-flight); carries its Nix wiring day-1 — default.nix fileset + ci::flake-check + zero-kolu-* dep-allowlist, B3.3 asserting KAVAL_BUILD_ID bit-identical (no forced restart) and B3.4 the build-id-reaches-server CI gate (#1326’s “stranded nix” was a missing assertion, not missing wiring); and maps its slice of #1034 / #1275 / F1–F4 to a type- or test-level fact. No “follow-up / degrades gracefully / flagged for review.”

The F-hazards — the data-loss modes #1326’s gauntlet missed, each now a falsifiable type/test fact: F1 an empty capture preserves the saved session (no empty→null erase before the kill); F2 killHalfWiredPty reaps a half-wired PTY (the shared reap receptacle B3.1 carved, B3.3 consumes); F3 the warming-window guard refuses terminal creates while the daemon comes up; F4 holdRestarting recovers honest state on a capture/drain failure. B3.2 killed F1 · F3 · F4; B3.3 owns only F2.

B3.4 — currency nudge shipped · #1353

Goal: an amber ⬆ update pending affordance on the rail’s kaval column when the adopted daemon is a build behind the kaval the freshly-deployed server would spawn — one click fires B3.2’s session-preserving restart to pick it up. This is the last B3 PR, and it lights the one rail state A2 wired divergence-capable but could never fire: in A2 the pty column couldn’t diverge from itself in-process, and under B2’s always-recycle a survivor older than one boot was precluded. The nudge is reachable only because B3.3 adoption replaced always-recycle — a wire-compatible survivor is now adopted and kept alive, so it can be a build behind. It is the deliberate opposite of B3.3’s contract-skew recycle: skew is forced (no choice), currency is “you don’t have to, but a restart would gain the new kaval.”

The comparison — two already-baked nix facts the code names but never compares (buildId.ts: “a read-site derivation (staleKey !== currentBuildId()) that phase B adds”):

OperandWhatWhere it is today
reportedthe adopted daemon’s own staleKeyalready on the wire — daemonStatus.identity.staleKey, already rendered shortId-form in the rail’s kaval column
expectedthe server’s kaval build-id — the build it would spawnprocess.env.KAVAL_BUILD_ID in the kolu-server process (one ${kavalBuildId} nix --sets onto both the koluBin wrapper and the kaval bin), read via kaval’s currentBuildId()no server code reads it yet

expected !== reported, derived at the read site (the rail), never stored, never frozen onto a connection. Because kavalBuildId is a content-hash of kaval’s daemon source closure only (the three roots kaval · terminal-protocol · surface-daemon, kept == the daemon’s reachable closure by buildId.closure.test.ts), a server-/client-only deploy leaves it bit-identical and the nudge stays silent — the #1034 over-prompting fix, by construction.

Deliverables:

Traps:

Verify: the unit truth table + the read-site echo; the over-prompting falsifiable test (server-only change → no nudge); the VM tier — adopt.nix asserts no nudge on a no-op redeploy, the build-skew check asserts the nudge fires and a confirmed restart recovers on the new build, each red under a deliberate mutation. The staged-prod gate below — a redeploy that flips kaval’s staleKey → amber nudge → confirmed restart → recovery honest — is B3.4’s acceptance signal; CI structurally cannot redeploy over a live daemon. Dep: B3.2 (the restart it fires) · B3.3 (reachability — adoption is what keeps a build-behind daemon alive to nudge about).

The gate — the second deploy (not a PR)

There is one kind of deploy — you ship the whole kolu closure and the server restarts; the daemon’s fate is keyed only on whether that deploy moved kaval’s staleKey. CI structurally cannot exercise “redeploy over a live daemon”; the acceptance signal is a staged prod checklist, a planned step: deploy B3 → open terminals + a running agent + a split → redeploy with kaval’s staleKey unchanged (no nudge; survivors adopted; the reattach: counts reconcile; the split still nested; the agent still running) → redeploy that flips kaval’s staleKey (amber nudge → confirmed restart → recovery honest; a deliberately failed respawn leaves the restorable session).

The #1034 postmortem — production failure and hard constraints

The first R4c-UI build ( #1034 ) shipped the build-mismatch “update pending” nudge + a Restart local PTY daemon command. On production (Linux/systemd, 2026-05-30) the nudge fired correctly on a deploy; clicking restart then destroyed a live 20-terminal session and could not bring the daemon back. From the journal: killAll drained all 20 terminals first; the old daemon (13.5h old, 25G RAM on a thrashing box) took ~2min to exit; the respawn timed out at 30s; the user was left with zero terminals, a dead daemon handle, and an empty canvas indistinguishable from “you have no terminals.” No second daemon ever ran — the failure was the respawn losing the race against the slow old-daemon exit.

Hard constraints (binding on the redo)

  1. Restart must be recoverable, never kill-then-pray. Snapshot the session first; after respawn, auto-reattach/offer restore; if respawn fails, a loud recoverable degraded state with the session preserved — never an empty canvas.
  2. Wait for the old daemon to fully exit — the single-instance lock fights the restart. Wait on actual process exit (kill(pid,0) → ESRCH) before spawning, with generous load-aware ceilings.
  3. Timeouts must fit a loaded production box, not an idle dev one (20 heavy PTYs + a tsx cold-start under swap ≫ 30s).
  4. Never lie about state — an explicit connecting/degraded UI, never silent emptiness.
  5. A persistent PTY status indicator, always visible — connected · connecting · degraded/dead · update-pending (became A2’s rail).
  6. Key staleness on pty-host, not the whole binary — hash the @kolu/pty-host source closure so the nudge fires only when a restart actually gains something (became A2’s staleKey; B1 re-roots the closure at the daemon entry).
  7. Back up the session before a destructive restart (Export/Import session shipped via #1046 ).
  8. Fix the preferences storm ( #1041 , fixed in #1050 — coalesced writes).

Execution lessons (process, not product)

From building + salvaging #1034, re-confirmed during #1275: typecheck every package before commit (a clean nix build is not proof CI passes, #1049 ); a new workspace package has a checklist (default.nix explicit fileset, staleKey-closure decision, lockfile, vitest wiring) — the fileset miss broke CI once already; never git stash in the worktree; grep for unresolved conflict markers before committing a merge; never defer with a someday-issue — gaps go into the plan and get done in the PR; when the terminal contradicts itself, confirm via an independent path before “fixing” a phantom. Wire-shape rules: additive wire fields are optional + no contract bump (a required field force-restarts a surviving older daemon just to add a diagnostic); the identity surfaced to users is the navigable one (commit SHA), with the source hash kept as the staleness key only. And the #1275-specific one: CI-green is necessary, not sufficient — the second deploy is the acceptance signal; budget the staged-prod loop as a planned step with the diagnostic logging built in advance.

Design notes that carry forward

The survivor is kaval only — node-pty fds + the @xterm/headless mirror + the raw VT taps + a unix socket + its own process entry. A kolu-server restart re-runs the providers against the surviving PTYs, so detection is never stale while the PTYs persist. Honest cost (accepted): metadata is no longer “warm” across a restart — a brief re-detection pass, trading warm-on-reconnect for freshness-on-deploy. Only a pty-host contract change (rare) forces terminal loss; a provider change (frequent) restarts the cheap layer with PTYs untouched. This corrects #1031, which daemonized a survivor that held the providers.

The cgroup mechanism (spike-verified). On Linux/systemd the daemon spawns via systemd-run --user (gated on INVOCATION_ID), landing in its own transient cgroup — a plain detached/setsid child does not survive on cgroup-v2 (KillMode=control-group walks cgroup membership, not the session — the #1031 Linux failure). macOS’s detached spawn already survives launchd. Caveats folded into B2: linger must be on; absolute daemon path (minimal unit PATH); per-spawn unique unit names (a dead unit can linger loaded). Single-instance two ways: the unit name plus the atomic pid-gate.

tmux/dtach considered and rejected: they only keep a PTY alive — no OSC-parsed taps, no headless snapshot for lazy-attach, no home for the provider DAG. You’d still build pty-host’s streaming layer next to tmux and inherit its session model on top.

History