R-4 — kaval, the standalone PTY daemon
The redo plan for R-4 — local PTY daemonization, reframed: the daemon is kaval, a standalone program in the drishti/odu tradition, and kolu is its first client. One rule (package boundary = process boundary = staleKey hash), a dumb fully-specified wire, and four PRs — the spawn-policy inversion, the kaval binary + client, the door with survival off, then survival — each complete w.r.t. the hazards it opens. Carries the #1034 postmortem and its hard constraints.
The R-4 plan of record — local PTY daemonization + @kolu/pty-host. Split out of the parent remote-terminals plan on 2026-05-30; the parent keeps R-1/R-1.5/R-1.6 (shipped) and R-2/R-3 (post-R-4), and now also owns the multi-host direction (1 local + N ssh-remote pty-hosts, host switching in the ChromeBar) that this plan must not foreclose. Current state: A1 (
#1055 ), A2 (
#1063 ), B0 (
#1292 ), B1 (
#1301 ), B2 (
#1310 ), B3.1 (
#1330 ), B3.2 (
#1337 ), B3.3 (
#1344 ), and B3.4 (
#1353 ) are on master. The first Phase B build (
#1275 ) shipped the full feature as one 40-commit PR, was verified live in production, and was then deliberately discarded: functional, but the architecture was discovered in review rather than designed. This note is the redo plan — same user-facing functionality, a designed architecture, and a PR split sized so each PR can be executed end-to-end by a single agent session. Reframed 2026-06-12: the daemon is not a kolu module that happens to run in its own process — it is kaval (Tamil kāval — watch, guard), a standalone program in the tradition of drishti and odu, with kolu as its first and biggest client. The split is now four PRs: B0 (the spawn-policy inversion) → B1 (the kaval binary + client) → B2 (the door) → B3 (survival). B0 shipped in
#1292 (2026-06-12) — the wire is fully specified and the package carries zero kolu-* deps; B1 shipped in
#1301 (2026-06-12) (the rename to kaval + kaval-tui, the daemon binary, @kolu/surface-daemon, full e2e); B2 — the door — shipped in
#1310 (2026-06-12) (the topology flip, @kolu/surface-daemon-supervisor, the honest degraded state + the live KAVAL rail column/dialog); B3 — survival — a four-PR chain now under way (B3.1, the seam-carving refactor, shipped in
#1330 ; B3.2, the supervised session-preserving restart, shipped in
#1337 ; B3.3, adoption — terminals survive a deploy that didn’t change kaval’s source — shipped in
#1344 ; B3.4, the currency nudge, shipped in
#1353 — the chain complete) after a one-big-PR attempt (
#1326 ) was closed (too big; two blocking data-loss bugs survived its own review). The same week, surface-daemon (
#1294 ) named the daemon-lifecycle machinery B1/B2 build as a shared spine with odu serve as its second tenant — and B1 now ships the spine’s daemon half as @kolu/surface-daemon from the get-go (the review-isolation decision below), plus full e2e coverage for the kaval + kaval-tui pair; the supervisor half is born as its own package — @kolu/surface-daemon-supervisor — in B2 itself (revised 2026-06-12 on B2 planning; S1 shrinks to the handshake move).
Companions: the srv · pty ChromeBar rail (A2’s deliverable, shipped) and kolu-tui (Phases 0–2 shipped — the CLI that dials the pty-host unix socket, today served by the in-process server, after B2 by the daemon; renamed kaval-tui in B1, when it gains a bin of its own).
The architecture — one rule, then a module map
The one rule everything follows from: the package boundary IS the process boundary IS the staleKey hash set. kaval (today
@kolu/pty-host; renamed in B1) contains exactly the code that executes inside the daemon — the PTY primitive, the wire contract, the taps, the socket serving, and the process entry (pid-gate acquisition, the daemon’s own root/rcDir, exit handling). The staleKey is “the nix hash of the package dirs that run in the daemon” — kaval whole, plus its daemon-side workspace roots (terminal-protocoltoday;surface-daemon’s daemon half after B1), each hashed whole with zero file-level exceptions — andbuildId.closure.test.tswalks the import graph from kaval’s two entries (index.ts, the embedded surface, + the daemonbin.ts) — the closure test now answers the exact question the staleKey asks: what would a restart gain? Everything that supervises from outside lives outside the hash — correctly, because changing the supervisor never changes what a restart would gain: the spine (endpoint states,waitForPidGone, the composed restart, the survivable-spawn mechanism) in@kolu/surface-daemon-supervisor(born B2, deliberately not a staleKey root), the soul (localDriver’s kaval parameter bundle, B3’s reconcile) inpackages/server. And after B0, so does all spawn policy — it crosses the wire as data, so changing kolu’s shell arcana never forces a daemon restart either.
kaval — a program of its own; kolu, its first client
Three insights (2026-06-12) reframe what the daemon is, without changing what Phase B builds:
-
The graduation pattern already exists. drishti grew out of the remote-process-monitor example; odu grew out of
mini-ci; both are surface agents that graduated from kolu’s monorepo once their domain stabilized. kaval — Tamil kāval, watch/guard: the thing that stands watch over your terminals — is the third graduate, and the most natural, sinceptyHostSurfacealready is adefineSurface()declaration. It has been a surface app all along, just trapped inside kolu-server’s process. B1 makes it graduation-ready (own bin, own socket, zero kolu deps); actual extraction to its own repo waits until the domain stops churning, exactly like the other two. -
The layering is: dumb-but-durable kaval ← kolu the session brain ← kolu the web UI. kaval holds fds, mirrors screens, serves taps — and nothing else. kolu-server remains a substantive middle tier, deliberately: the #1031 postmortem is binding (daemonizing the provider DAG served stale detection on every deploy), so session persistence, reconciliation, the provider DAG, and all spawn policy stay kolu’s domain, re-run fresh against surviving PTYs. kolu is not a thin client — but it is a client, one of several (kaval-tui today, an MCP face later).
-
Spawn policy crosses the wire as data, never as code. Today
cleanEnv/koluIdentityEnv/prepareShellInitare baked into the daemon’s spawn handler — kolu’s sensor system (the OSC 7/2/633 hooks powering cwd tracking, foreground detection, and agent awareness) implanted host-side. B0 inverts this: kaval exposes host facts (system.info → {shell, home, platform, rcDir}), andspawntakes the full specification{argv, env, initFiles}— the daemon writes the rcfiles it is handed and asks no questions.TERM_PROGRAM=koluis asserted by the tier that renders pixels, not the tier that holds fds — tmux doesn’t claim to be your terminal emulator either. A bare kaval client spawns a plain shell; the tap channels just stay quiet. The inversion is also what makes a remote kaval possible: shell-init content for a host that isn’t kolu’s own machine is computable fromsystem.info, and the rcfiles land daemon-side where kolu’s hands can’t reach.
The remaining axes, most stable → most volatile (the boundaries that own them):
| Axis | Rate | Boundary that owns it |
|---|---|---|
| Everything in the daemon process | rarely | kaval (today @kolu/pty-host) — hashed whole, alongside its daemon-side roots (terminal-protocol; surface-daemon’s daemon half post-B1); the entry travels with the package, so R-2’s remote daemon gets single-instance arbitration for free. |
| Wire contract + compat + identity | per wire change | ptyHostSurface + PTY_HOST_CONTRACT_VERSION + PtyHostIdentity, in-package (unchanged from A1/A2; bumped to 3.0 in B0 ✅ for the fully-specified spawn + system.info). |
| Spawn policy — env basis, identity vars, shell-init rcfiles | per shell/OS quirk | Client-side: kolu-pty consumed by packages/server; crosses the wire as data (spawn {argv, env, initFiles}), never as code. kaval stays host-agnostic. |
| Identity values: staleKey vs navigableCommit | key: per wire change · commit: per deploy | buildId.ts (unchanged). Currency is derived at the read site (staleKey !== currentBuildId()), never stored, never frozen onto a connection. |
| Endpoint status (the one health owner) | per endpoint-model change | @kolu/surface-daemon-supervisor’s endpoint.ts (born B2) — a single-host endpoint per spec (keyed implicitly by its spec.hostId), emitting {state, identity, startedAt} on every transition via onStatus(hostId, …). The per-hostId map lives server-side — daemonStatus.ts’s Map<hostId, DaemonStatus> (a map of one, today: local); everything else — buildInfo, rail, degraded canvas — derives by subscription. |
| Survivable-spawn mechanics | per platform quirk | @kolu/surface-daemon-supervisor’s default DaemonDriver (survivableSpawnDriver(DaemonSpawnConfig)) — the fromSource gate (!fromSource && INVOCATION_ID set → systemd-run --user with per-spawn unique unit names; otherwise, macOS included → detached+unref); waitForPidGone (ESRCH poll, load-aware ceiling); composes the gate’s gatePid/isHolderLive. |
| kaval’s reach values | per kolu packaging change | localDriver.ts in packages/server — the parameter bundle: the kaval binary from the kolu closure, dev-flag exec-arg filter, unit prefix, socket/gate paths, the --setenv set (shrunk to daemon-operational vars (XDG_RUNTIME_DIR) — PTY env arrives per-spawn on the wire after B0). R-2 adds sshDriver as a sibling (HostSession reach + provisionAgent, shipping the same kaval closure). |
| Recovery sequence | per recovery policy | restart.ts — one composed capture → drain → recycle → reattach, where recycle is endpoint.ensure() (kill → waitForPidGone → spawn → connect); the three caller steps (RestartSteps) are non-optional in the type, with save folded into capture. The forced and user paths are the same function. (B3.2 ✅
#1337 added serializeRestart — coalesces concurrent callers onto one in-flight recycle — and the restarting state, held across the whole sequence by endpoint.holdRestarting; ENDPOINT_STATES is now connecting/connected/restarting/degraded/dead.) |
| Reconciliation + adoption | per persisted field | reconcile.ts (a pure function: daemon.list() × savedSession → {adopt, adoptOrphans}) + adoptTerminal as a sibling of the existing spawnAndWire in local.ts (adopt and spawn converge on one post-wire path). Adoption consumes the whole SavedTerminal record as a unit — never field-by-field reconstruction — and a schema-level round-trip test iterates the schema’s keys. |
| Health → UI projection | per UX tweak | A per-host status collection on the surface; rail + DegradedCanvas + restart dialog subscribe. User-facing name: “kaval” — it’s a thing of its own, and the name says so. |
The daemon-package question is settled in two halves (revised 2026-06-12, after
#1294 ). surface-daemon names the shared spine — atomic pid-gate, the daemonMain skeleton (gate → serve → SIGTERM teardown), the system.version/build-id handshake fragment, and the endpoint supervisor (state machine · spawn/waitForPidGone drivers · composed restart) — and identifies the second tenant: not R-2’s ssh driver (still a sibling behind the endpoint concept, not a consumer of the local driver’s code) but odu serve, odu-runner’s long-lived CI coordinator, which needs the identical four pieces. The split: the daemon half is born as @kolu/surface-daemon in B1 itself; the supervisor half is born as its own separate @kolu/surface-daemon-supervisor package in B2 itself (the decision on shape — a package, never a /supervisor subpath — stands; what fell on B2 planning, 2026-06-12, was the deferral that had it gestate in server/src/ptyHost/ until an S1 extraction). Creating the daemon package from the get-go buys two things — review isolation (the mechanism is reviewed once, as a self-contained package; the rest of B1 stays renames plus a ~20-line composition) and one home for the gate’s file format (acquirePidGate for the daemon, plus the gatePid/isHolderLive primitives B2’s supervisor composes where it lives — so no supervisor reader sits in the daemon-hashed package). The usual one-consumer objection is weak here: the parameter surface (scope key, socket path, the router, lifetime forever | idleTimeout) is already designed against both consumers in the surface-daemon note, and a private workspace package is cheap to re-cut when odu serve arrives. The staleKey constraint shapes where, never when: the daemon package is hashed whole into kaval’s key, as a third root beside terminal-protocol (the closure test’s existing multi-root pattern) — correct because everything in it is part of the one daemon binary a restart loads (the serve half in the daemon process, and since P2.5 the durable stdio front frontDaemonOverStdio in the per-link proxy reached from bin.ts’s --stdio dispatch). A supervisor file in the daemon package would flip kaval’s key on every supervisor-only edit (A2’s over-prompting failure, reborn), so the daemon package carries a standing invariant — only daemon-binary code (serve + front) may live here — and the supervisor gets its own package, where the package boundary is the hash boundary with no subdir glob to mis-scope. The original sequencing deferred that package to S1; on B2 planning it failed its own audit. “Wait for the API to settle” is a backwards-compat argument, and a private workspace package with one downstream — edited in the same PR — has no compat obligation. “Wait for the second consumer” lost to the very precedent it sat beside: the daemon package was born in B1 with one consumer. And the quantified cost is ~five scaffolding files plus one default.nix fileset line, with zero hash surface — the supervisor package is deliberately not a staleKey root, so there is no closure-test or build-id wiring to get wrong. What early birth buys is the same review isolation that won in B1, plus the spine/soul trap enforced by the package boundary from the first commit instead of by reviewer vigilance — the #1275 mid-PR-extraction lesson applied in advance. S1 shrinks to one move: the handshake fragment into @kolu/surface.
Multi-host readiness (the shapes, not the feature)
The parent note owns the direction: 1 local + N ssh-remote pty-hosts, host switching in the ChromeBar, reattach to all of them. Phase B ships exactly one endpoint but must be host-count-agnostic in its shapes: the endpoint map is keyed by hostId; daemon status is a per-host collection (not a singleton cell); adoption and the saved session join on id today, shaped to become (host, id) in R-2; the TerminalLocation discriminator from R-1 already sits at the single getTerminalBackendFor dispatch seam, ready for the persisted location field R-2 threads onto records; and the daemon-side code is one hashed package that runs identically on a remote host. None of this costs more than the singleton version; retrofitting any of it later costs another #1275.
The kaval reframing collapses R-2’s provisioning story into existing machinery: a remote pty-host is the same kaval closure shipped by @kolu/surface-nix-host — nix copy, realise, run — exactly how odu provisions odu-runner onto lane hosts and drishti its agent. B0’s system.info is what makes spawn policy computable for a host that isn’t kolu’s own machine; B0’s initFiles is what lets the rcfiles land on a disk kolu’s hands can’t reach.
What this makes impossible by construction
The production failures — four from #1034, four more paid for during #1275 — each become a failing test at the phase that owns the concept:
| Production failure | Killed at | How |
|---|---|---|
| Mis-scoped staleness key (wire change didn’t flip it) | A2 ✅ | Closure-scoped key + import-walk guard (now re-rooted at the daemon entry in B1). |
| Over-prompting (key nudged on every deploy) | A2 ✅ | Server-only change leaves the key bit-identical — a falsifiable test. |
| Data-loss restart (kill-then-pray, #1034) | B3.2 ✅ | One composed restart, snapshot-before-kill is the capture step (setSavedSessionFromSnapshot); the drain fires no terminals:dirty, so no autosave can clobber the capture. |
| Empty-canvas lie (dead daemon → “no terminals”) | B2 | Honest dead/degraded state ships with the door, before any survival promise exists. |
| Lossy adoption (#1275: splits un-nested, then agent-resume lost and autosaved, poisoning cold restores) | B3.3 | Adoption consumes the whole persisted record; a schema-key-iterating round-trip test closes the class; a non-survivor is an exited shell, dropped as handleExit already does — never a hidden, autosave-clobbered restore card. |
| Lazy adopt-on-spawn (#1275: orphan shells respawned into the daemon — duplicated terminals) | B3.3 | Eager boot-time reconciliation is the only adopt path; a live PTY with no saved record (a create that raced the debounced autosave) is adopted from the live snapshot, never re-spawned (so no duplicate), and never reaped (so it survives the redeploy); exited shells are dropped (not respawned). |
| Identity gone stale after restart (#1275: one-shot read, then manual republish at enumerated call sites) | B2 | One status owner emitting on every transition; everything derives by subscription. |
| Contract-skew crash-loop masked by launchd as an “App updated” loop (the zest field report on #1275 ) | B2 | Version checked over the socket on every connect, never an import-time throw; skew at boot → controlled recycle (B2) / composed restart (B3.2). |
The PRs — each an executable brief
No feature flags — each PR ships complete to master and leaves kolu working. The hazard rule, refined from the first attempt: each PR must be complete w.r.t. the hazards it opens — B0 opens none (an in-process refactor with byte-identical behavior), and the door can open with the survival promise off, which empties its hazard set by policy. B0–B2 shipped as single PRs; B3 is itself a four-PR chain (below), each self-sufficient, after the one-big-PR attempt ( #1326 ) was closed.
B0 — the inversion: dumb wire, client policy shipped
Goal: flip who owns spawn policy. The pty-host’s wire becomes fully specified — the daemon spawns exactly what it is told and asks no questions — and every kolu-ism moves to (or stays in) kolu’s tier. In-process only; zero user-visible change; zero new processes. This is the contract change, made at the one moment it is free: no daemon exists yet, no survivors, both consumers in-repo. Shipped in #1292 (merged 2026-06-12) — contract bumped to 3.0; every Verify item below held (CI green on both platforms; the closure test’s tightened allowlist now enforces the dependency diet). The review gauntlet’s one real find, folded in: init files are cleaned up on partial-write and on spawn failure, not just on PTY exit.
Deliverables:
- The contract change (one
PTY_HOST_CONTRACT_VERSIONbump; the server moves in lockstep in the same PR, and an old kolu-tui binary refuses politely via the existing handshake):spawninput becomes{id, argv, cwd, env, initFiles: [{name, content}], scrollback?}— the daemon writesinitFilesinto its own rcDir before spawn and removes them when the PTY exits;envandargvarrive whole; nothing is derived from the daemon’sprocess.env.system.info→{shell, home, platform, rcDir}— host facts, read once per connection, so a client can compute shell-init content for a host that isn’t its own machine (the R-2 enabler);rcDiris the host-side dir the client points itsinitFiles/argv/envpaths at.
- Policy relocation: the server’s spawn call sites compose
cleanEnv()+koluIdentityEnv()+prepareShellInit()(againstsystem.info) into the wire input.kolu-pty’s ~330 lines of shell arcana move tiers untouched —kolu-ptybecomes apackages/serverdependency only. - Dependency diet:
kolu-pty,kolu-common,kolu-shareddrop out ofpackages/pty-host— PTY ids are opaque strings at the wire (kolu keepsTerminalIdSchemaat its own boundary); scrollback is a spawn input with an in-package default;Loggerbecomes a structural in-package type. What remains is graduation-compatible:@kolu/surface,@kolu/terminal-protocol,node-pty,@xterm/*. - Socket path: the app-name in
getPtyHostSocketPath()parameterized (no hardcoded “kolu”), still defaulting to today’s path until B1.
Traps: behavior parity IS the deliverable — golden-test that the assembled rcfile content reaching the PTY is byte-identical before/after the inversion; no policy may remain daemon-side (a test walks the package’s import graph and fails on any kolu-pty/kolu-common/kolu-shared hit); the staleKey flips on this deploy (a wire change — correct, not a regression).
Verify: e2e green — titles, cwd tracking, agent detection all ride the relocated OSC hooks, so the existing features are the test; packages/pty-host/package.json shows zero kolu-* workspace deps; a server-only change after this PR still leaves the staleKey bit-identical.
B1 — kaval: the binary and its client shipped · #1301
Goal: the daemon becomes a thing of its own — packages/pty-host → kaval, packages/pty-tui → kaval-tui, both with bin entries, runnable as a pair on a machine where kolu has never been installed. Zero changes to packages/server beyond mechanical import renames. In review in
#1301 .
Deliverables:
@kolu/surface-daemon— the spine’s daemon half, a new package from the get-go (the two-halves decision above).pidGate.ts: the atomic gate, both sides — acquire (write pid to a temp file,link(2)it into place; onEEXISTread the gate, liveness-probekill(pid,0), exit 0 if a live daemon holds it, unlink-and-retry if stale) and read (which B2’s supervisor imports from here).daemonMain.ts: the skeleton — gate → serve → on SIGTERM close the socket, release the gate, exit; one structured boot log; parameterized over gate path, socket path, the surface router, and lifetime (forever|idleTimeout(ms)— kaval picksforever,odu servewill pick the other). Program-agnostic: kaval’s choices arrive as parameters, never as inline branches, and the gate-race unit tests live here with the mechanism. The full export surface — with kaval’s and odu serve’s compositions side by side — is sketched in surface-daemon.packages/kaval/src/daemonMain.ts— a thin composition (~20 lines): the skeleton with kaval’s parameters — gate + socket under$XDG_RUNTIME_DIR/kaval/for a standalone run (a--socketflag overrides — kolu-server passes a per-instancekaval-<port>/path since #1313 , so two servers never share a gate), the daemon’s own root + rcDir from an in-package helper (do not import the server’skoluRoot),servePtyHost’s router, lifetimeforever. (B0 already removed the daemon’s env role — there is no env-application step.)- The rename: package
kaval, binkaval; env varsKAVAL_BUILD_ID/KAVAL_COMMIT_HASH(staleKey semantics unchanged in spirit: the nix hash of the daemon-side package dirs); nix wiring per the new-package checklist, twice — the kaval rename and the new surface-daemon package (default.nixfileset, lockfile, vitest). kaval-tui: its own bin, dialing kaval’s socket by default,--socketflag to override (until B2, reaching a running kolu’s in-process terminals needs the explicit kolu path — it’s a standalone tool now, and that’s the point).- kolu-tui is dropped entirely — no compat bin, no alias; the website’s
/tuipage (website/src/pages/tui.astro) is rewritten around kaval + kaval-tui, presenting them as the standalone pair they now are — later replaced outright by the kaval-centric/kavalpage (website/src/pages/kaval.astro, with a logo and Tamil etymology) in #1314 . buildId.closure.test.tsre-rooted at the daemon entry, now walking three roots — kaval,terminal-protocol,surface-daemon— every file hashed and reachable-or-test, zero exceptions. The server importing surface-daemon is fine (B2 composes the gate’sgatePid/isHolderLiveprimitives where it lives); supervisor code living in surface-daemon is not — it lands in its own package,@kolu/surface-daemon-supervisor, born in B2.- e2e, full coverage — B1’s promise is “kaval is a real program,” and the proof is end-to-end, against a real spawned
kavalprocess over its unix socket:- The corpus over both links. The contract lifecycle corpus (spawn → list → snapshot-first attach → write/resize → screen state/text → the taps → kill/killAll, today inline in
inProcessPtyHost.test.ts) becomes an exported suite factory instantiated twice — over the identity link (today’s fast path, unchanged) and over the spawned daemon’s socket. One corpus, two links: the daemon can never drift from in-process behavior unnoticed. Closure-guard constraint: the corpus and helpers live in.test.tsfiles — a shared non-test helper in a hashed package is unreachable-from-entry and fails the closure test (a lesson already paid for, in B0’s review). - The coverage ledger. A meta-test iterates
ptyHostSurface’s procedure and stream keys and fails if the socket-link corpus left any unexercised — “full coverage” is mechanical, the same philosophy as the closure test and B3’s schema-key round-trip. - Daemon-only scenarios, the part no in-process test can reach: boot + handshake (
system.versioncompatible, identity populated,system.infosane); the gate-race choreography with real processes (A acquires → B exits 0 while A lives → SIGKILL A → C steals the stale gate); double-bind hits the socket’salready-servedrefusal as the second tripwire; SIGTERM teardown (socket unlinked, gate released, clean exit); initFiles across a real process boundary (the rcfile’s effect visible ingetScreenText, the file removed on kill); SIGKILL mid-attach → the client’s stream errors rather than hangs; restart on the same socket serves fresh with an emptylist(B1 makes no survival promise — pin the honest behavior);kaval-tuilist + attach against the real daemon. - Hygiene: per-test
mkdtemp’dXDG_RUNTIME_DIR; teardown kills every spawned daemon; timeouts sized for a loaded CI box, not an idle dev one (#1034 lesson 3). Runs in the ordinary vitest lane on both CI platforms — the darwin lane’s presence is itself a check.
- The corpus over both links. The contract lifecycle corpus (spawn → list → snapshot-first attach → write/resize → screen state/text → the taps → kill/killAll, today inline in
Traps: no import-time throws anywhere; no top-level await; neither kaval nor surface-daemon may import from packages/server. (The first attempt’s socket-collision trap dissolves by construction: kaval’s default path differs from the in-process kolu path.) The package boundary IS the spine/soul line now: nothing kaval-specific enters @kolu/surface-daemon (kaval’s choices arrive as parameters), and nothing supervisor-side enters it either — the whole-package staleKey hash forbids it, standing; the supervisor half gets its own un-hashed package in B2.
Verify: the coverage ledger passes — every contract procedure and stream exercised against a real spawned daemon over the socket. On a clean box (a pu box is ideal): run kaval, spawn + drive + detach + reattach a shell with kaval-tui — kolu nowhere in the picture (the human smoke over what the e2e suite automates). A server-only change leaves KAVAL_BUILD_ID bit-identical across two nix builds; touching daemonMain.ts or anything in surface-daemon flips it (that code runs in the daemon — correct, not a regression).
B2 — the door, with survival off shipped · #1310
Goal: flip the topology — the server becomes a client of a daemon it spawns — while keeping user-facing semantics byte-identical to today: the boot policy is always recycle (connect-if-survivor → kill → waitForPidGone → spawn fresh → connect). No survivors exist, so no survival hazard can open: no orphans, no skew older than one boot, nothing for a restart to pray over. Every production deploy now exercises kill → wait-for-real-exit → respawn — the exact race #1034 lost — with zero sessions at stake.
Deliverables:
@kolu/surface-daemon-supervisor— the spine’s supervisor half, a new package from the get-go (revised 2026-06-12; the two-halves decision above).endpoint.ts(state machineconnecting | connected | degraded | dead; a single-host endpoint per spec that emits{state, identity, startedAt}on every transition — the per-hostId map is server-side),waitForPidGone.ts(pollkill(pid,0)→ ESRCH; load-aware ceiling, ~120s default),restart.ts(the composed sequence — in this PR invoked only as the boot recycle, with capture/drain/reattach in their degenerate forms; the type already requires all three steps), and the survivable-spawn mechanism as the package’s defaultDaemonDriver(survivableSpawnDriver) — thefromSourcegate (a built binary withINVOCATION_IDset →systemd-run --user; a from-source/dev caller, or macOS → plain detached+unref), per-spawn unique unit names, absolute-binary-path discipline — parameterized overDaemonSpawnConfig({binPath, args, env, unitPrefix, fromSource}): host-platform volatility, not kolu volatility. Generic over the daemon client/identity types<C, I>; the driver and theconnectclient arrive viaEndpointSpec; composes the gate’sgatePid/isHolderLiveprimitives from@kolu/surface-daemon(a one-directionalworkspace:*edge). Zerokolu-*deps, pinned by a dependency allowlist like surface-daemon’s own — and not a staleKey root: the supervisor never runs in the daemon, so it adds zero hash surface. Ships with a README in the surface-daemon mold: what’s mechanism vs soul, the public API table, and a usage example showing kolu-server’s composition (withodu serve’s projected one beside it).packages/server/src/ptyHost/— the soul, now a parameter bundle:localDriver.tssupplies the values the spawn mechanism takes — the kaval binary resolved from the kolu closure, the dev-flag exec-arg filter, thekaval-unit prefix, the socket/gate paths, and the--setenvforwarding set (shrunk to daemon-operational vars likeXDG_RUNTIME_DIR, because PTY env arrives per-spawn on the wire since B0: the #1275 env-forwarding bug class is gone by construction) — plus the composition wiring driver and surface client into the supervisor package’s machinery.- Composition root:
index.tsboots explicitly — parse argv →ensureLocalEndpoint()→ constructLocalTerminalBackendwith the injected client → listen.server/src/ptyHost.ts’s import-time construction is deleted; no module-global client, no top-level await. - The flip:
LocalTerminalBackend’s client is the endpoint’s socket-backed client; the server stops serving its pty-host socket (today’sindex.ts:397-407) — kaval serves its own;kaval-tui’s default socket now reaches kolu’s terminals with no flags. - Contract handshake on every connect (
isContractVersionCompatibleagainstsystem.version); skew at boot → recycle (the boot policy anyway). Never an import-time version throw. - Honest minimal state: a per-host daemon status on the surface; the rail’s KAVAL column lights up (uptime from the daemon’s
startedAtbeside SRV’s, green/red dot); a mid-session daemon death presents as an explicit degraded state — visibly distinct from “you have no terminals”. - e2e: per-worker
XDG_RUNTIME_DIRsocket isolation; hooks teardown kills spawned daemons.
Traps: no mode flag — the socket is the path (a KOLU_DAEMON_MODE toggle is a feature flag in a trench coat); do not implement adoption (assert terminal.list() is empty after the recycle); do not widen any shared schema; restore-card-on-deploy behavior must be exactly today’s. The surface-daemon constraint, supervisor half: endpoint.ts’s state machine, waitForPidGone, restart.ts’s composed sequence, and the survivable-spawn mechanism are spine and live in @kolu/surface-daemon-supervisor from birth (odu serve’s CLI reuses them at S2) — the package boundary enforces what was previously a review-vigilance rule. The line through the driver: the incantation is spine, the values are soul — the INVOCATION_ID gate, the systemd-run/detached fork, and the unique-unit-name discipline ship in the package; everything kaval-supplied (the binary, the dev-flag filter, the --setenv values, the paths, the unit prefix) stays inside localDriver.ts in packages/server, the one file that is soul — it cannot drift into the spine without crossing a package edge the dependency allowlist refuses.
Verify: e2e green on both platforms (the darwin lane’s presence is itself a check — its absence was once silent); kaval-tui works against the daemon with no flags; pkill-ing kaval mid-session shows the honest dead state; a staged deploy shows the recycle in the journal.
B3 — survival: four PRs, not one complete — B3.1–B3.4 shipped (currency · #1353 )
Goal: terminals survive a kolu update. The first build did it as one ~1800-line PR ( #1326 , closed) — too big to review (two blocking data-loss bugs survived its own gauntlet), the spine grew mid-implementation, and a misframed edge case (partial survival) dragged a race-sensitive autosave swamp into the diff that wasn’t even needed (see the crux). The redo is a shallow chain of four PRs — one refactor, then one capability each.
| PR | Kind | Ships (user-visible) | Spine symbol — frozen up front | Hazard killed by construction | Dep |
|---|---|---|---|---|---|
| B3.1 #1330 | refactor | none | spine: extract the ensure() helpers (liveServingHolder · killLiveHolder · spawnConnectHold); server: extract killHalfWiredPty (the F2 reap receptacle) + dedup the snapshot shape into one SessionSnapshot type (producer + autosave) | byte-identical (endpoint.test.ts is the gate); every extraction has a live consumer — the F1 receptacle (setSavedSessionFromSnapshot) ships in B3.2 beside its consumer, not as bare future-API here | — |
| B3.2 #1337 | feature | session-preserving restart of kaval — on a running daemon (pick up a new build · user-initiated) and a dead/degraded one (recover); finishes B2’s deferred “Restart kaval” button | restarting + serializeRestart + emit-guard; introduces setSavedSessionFromSnapshot (the F1 receptacle) beside its restart-capture consumer | F1 (the empty→null guard + its autosave-cancel test land here, with the consumer) · F3 · F4; fully CI/e2e-testable (recycle→fresh, restore from the empty canvas) | B3.1 |
| B3.3 #1344 | feature | live PTYs — process + scrollback + running agent — survive a deploy that didn’t change kaval’s source (staleKey unchanged — the common case) | adoptOrEnsure — reconcile before wiring | #1275 whole-record adopt + F2; a non-survivor is an exited shell → dropped (kolu’s normal handleExit), never restore-carded | B3.1–2 |
| B3.4 #1353 | feature | amber ⬆ update pending when kaval is a build behind → one-click recycle | none (reads existing identity.staleKey) | #1034 over-prompting (keyed on closure hash only); CI gate: build-id reaches the server | B3.2–3 |
Why B3.4 stands alone: it’s a distinct capability (the nudge + currency derivation) with its own hazard (#1034 over-prompting — never nudge when a deploy left kaval’s staleKey unchanged) and its own Nix gate, and it depends on both B3.2 (the restart it fires) and B3.3 (reachability — a build-behind daemon only exists once adoption keeps one alive; under always-recycle there’s nothing to nudge about). Its executable brief is B3.4 — currency nudge, below.
The crux — the partial case dissolves; #1326’s swamp was a conflation. Every #1326 bug and the session.ts doubling came from one mistake: treating an adopt-case non-survivor like a restart-case one. They’re different:
| Case | A non-survivor is… | Right move |
|---|---|---|
| B3.2 restart — daemon killed, all PTYs die | a terminal you still want (you didn’t close it) | restore it (re-spawn from the captured session) — on the empty canvas, so no live survivors, no autosave race |
| B3.3 adoption — daemon survived, a PTY is gone | an exited shell (its process ended in the restart window) | drop it — exactly what kolu’s handleExit already does when a shell exits with the server up |
So adoption’s “partial” case is trivial — adopt the live, drop the exited — with no restore card, no recycle, no autosave-durability machinery. #1326 mis-applied the restart-case restore card to the adopt case, which forced the pendingRestoreCard / union / session.restored cluster that burned four codex rounds for a problem that doesn’t exist. (If kolu ever keeps exited terminals around instead of closing them, revisit — that’s a separate feature.)
What upstreams to the spine libraries
@kolu/surface-daemon / @kolu/surface-daemon-supervisor are our libraries with our consumers (kolu today; odu serve next). We change them freely — no backwards-compat tax, and one consumer today is never a reason to keep mechanism in kolu. Program-agnostic mechanism upstreams; only kolu’s session/terminal policy stays soul. The split is the same line every PR draws: the spine adopts / recycles / serializes a connection; kolu reconciles that connection’s contents.
Upstreams → @kolu/surface-daemon-supervisor (mechanism) | Stays in kolu-server (soul — session/terminal policy) |
|---|---|
B3.3 adoptOrEnsure() — adopt a live, handshake-compatible survivor; recycle only an absent / dead / genuinely-skewed one. F4: a live survivor is killed only on a typed DaemonContractSkewError the soul’s connect raises (the one failure that proves incompatibility); a non-skew connect failure (transport dial / unreadable handshake) is retried and, if it persists, the survivor is left up (reported degraded) — never killed — so a daemon we merely cannot reach right now keeps its live PTYs. Generic: it adopts-or-recycles via the injected connect, branching only on the soul’s typed skew marker, knowing nothing of PTYs | reconcile.ts — which daemon PTYs map to which saved terminals; adoptTerminal (a sibling of spawnAndWire) whole-record adoption |
B3.2 restarting state + serializeRestart (coalesce concurrent triggers) + the emit-guard (hold restarting across the inner recycle) | setSavedSessionFromSnapshot (introduced here, beside its consumer — moved out of B3.1 as bare future-API) + the daemon.restart RPC |
B3.1 ensure() helper extraction · already there from B2: composed restart, waitForPidGone, the driver, per-transition identity reporting | killHalfWiredPty (the reap receptacle) · SessionSnapshot (one snapshot type spanning producer + autosave) |
Every PR also: names its spine symbol up front and ships it with its consumer (#1326’s spine doubled mid-flight); carries its Nix wiring day-1 — default.nix fileset + ci::flake-check + zero-kolu-* dep-allowlist, B3.3 asserting KAVAL_BUILD_ID bit-identical (no forced restart) and B3.4 the build-id-reaches-server CI gate (#1326’s “stranded nix” was a missing assertion, not missing wiring); and maps its slice of #1034 / #1275 / F1–F4 to a type- or test-level fact. No “follow-up / degrades gracefully / flagged for review.”
The F-hazards — the data-loss modes #1326’s gauntlet missed, each now a falsifiable type/test fact: F1 an empty capture preserves the saved session (no empty→null erase before the kill); F2 killHalfWiredPty reaps a half-wired PTY (the shared reap receptacle B3.1 carved, B3.3 consumes); F3 the warming-window guard refuses terminal creates while the daemon comes up; F4 holdRestarting recovers honest state on a capture/drain failure. B3.2 killed F1 · F3 · F4; B3.3 owns only F2.
B3.4 — currency nudge shipped · #1353
Goal: an amber ⬆ update pending affordance on the rail’s kaval column when the adopted daemon is a build behind the kaval the freshly-deployed server would spawn — one click fires B3.2’s session-preserving restart to pick it up. This is the last B3 PR, and it lights the one rail state A2 wired divergence-capable but could never fire: in A2 the pty column couldn’t diverge from itself in-process, and under B2’s always-recycle a survivor older than one boot was precluded. The nudge is reachable only because B3.3 adoption replaced always-recycle — a wire-compatible survivor is now adopted and kept alive, so it can be a build behind. It is the deliberate opposite of B3.3’s contract-skew recycle: skew is forced (no choice), currency is “you don’t have to, but a restart would gain the new kaval.”
The comparison — two already-baked nix facts the code names but never compares (buildId.ts: “a read-site derivation (staleKey !== currentBuildId()) that phase B adds”):
| Operand | What | Where it is today |
|---|---|---|
| reported | the adopted daemon’s own staleKey | already on the wire — daemonStatus.identity.staleKey, already rendered shortId-form in the rail’s kaval column |
| expected | the server’s kaval build-id — the build it would spawn | process.env.KAVAL_BUILD_ID in the kolu-server process (one ${kavalBuildId} nix --sets onto both the koluBin wrapper and the kaval bin), read via kaval’s currentBuildId() — no server code reads it yet |
expected !== reported, derived at the read site (the rail), never stored, never frozen onto a connection. Because kavalBuildId is a content-hash of kaval’s daemon source closure only (the three roots kaval · terminal-protocol · surface-daemon, kept == the daemon’s reachable closure by buildId.closure.test.ts), a server-/client-only deploy leaves it bit-identical and the nudge stays silent — the #1034 over-prompting fix, by construction.
Deliverables:
- Surface the server’s expected kaval build-id. Add a server read of its own
KAVAL_BUILD_ID(kaval’scurrentBuildId(), in-process — the koluBin wrapper bakes it, so no daemon round-trip) and emit it as an additive optionalexpectedKavalfield — the build the server would spawn — on the surface-appbuildInfocell. It lives onKoluBuildInfo(kolu’s extension of surface-app’s baseBuildInfo), so it changes no@kolu/surface-appAPI drishti consumes — no surface-mirror PR. The reported operand is untouched by this: it already ridesdaemonStatus.identity(a per-host daemon fact);expectedKavalis the one server fact (per deploy), so the two read distinctly and the join generalizes to R-2’s N hosts (one expected, N reported) for free. (While here, retire the now-unreadbuildInfo.ptyHostrelay — its server patch (server/src/surface.ts) and schema field — a dead duplicate ofdaemonStatus.identitythe rail already abandoned; an in-scope cleanup, not load-bearing for the nudge.) (Considered and rejected: a server-foldedupdatePendingboolean mirroring B3.3’sadopted— it puts the comparison server-side, but the plan mandates a read-site derivation and the existing client-vs-serverclientStaleoverlay is the precedent.) - The read-site derivation — a pure
kavalStale(expected, reported, state): surface-app’sclientIsStalepure-fn + clean-ref-guard shape, plus astate === "connected"gate (clientIsStale has no state gate — client-vs-server is always comparable; a daemon’s identity is present only onceconnected). Fires only whenstate === "connected"and both ids are non-empty and they differ. Keys onstaleKeyalone — nevernavigableCommit(the git ref moves every deploy: #1034 in a new costume). It is orthogonal toDaemonState— not a new row inDAEMON_STATE_PRESENTATION(a build-behind daemon is honestlyconnected/ok) but a second axis, exactly like the client column’sclientStaleoverlay. - The rail nudge. An amber ⬆ update pending affordance on the
kavalcolumn (IdentityRail.tsx), mirroring the client column’s<Show when={stale()}><StaleBadge/></Show>(the warning token —text-warning/border-warning); one click →restartDaemon()(B3.2’s existing action), disabled whilerestartInFlightand riding B3.2’s already-shipped inline confirm (RestartKavalButton).KavalInfoDialoggains the matching “running X · expected Y · restart to update” line. - The CI gate — “the build-id reaches the server”, two tiers:
- Unit (rides
pnpm test:unit): the server’s expected-build-id read echoes the baked env, including the off-nix""case (parallel tobuildId.test.ts); andkavalStale’s truth table — equal→silent, differ→nudge, either-empty→silent (the off-nix""guard, theisCleanRef/DEV_COMMITanalog), non-connected→silent. - End-to-end (rides
ci::home-manager, the adoption-test lane — no new recipe): the existingadopt.nixpositive path (a together-built redeploy that adopts) gains an assertion of no update-pending — expected == reported, the no-op-deploy-no-nudge proof, the exact failure mode #1034 was; and a new build-skew check (a sibling ofskew.nix) deploys a koluNew whose kavalstaleKeydiffers from the surviving old daemon’s but whose contract version does not → the survivor is adopted (not recycled) → update-pending → restart → recovery on the new build. Two execution notes the sibling must not inherit blindly fromskew.nix: (1) the skew seam is notcontractVersionOverride’s postPatch sed of a source constant —KAVAL_BUILD_IDis a nix-injected value, so akavalBuildIdOverridearg substitutes koluNew’s${kavalBuildId}let-binding (default null = the real hash, so prod is untouched), making koluNew a coherent newer build whose expected==its-own-spawn while the default-built old survivor reports the real hash — a genuine build-behind skew with no source diff; (2) the verify predicate inverts skew.nix’s — assert the gate pid is unchanged (adopted, not recycled) and update-pending is observed, where skew.nix asserts the gate pid changed. Observation channel (the headless VM has no browser, so it observes the derivation’s inputs, not the rendered chip): assert the survivor’s reportedidentity.staleKeyagainst the server’s expectedKAVAL_BUILD_ID— read both off the surface RPC the adoption tests already drive (buildInfo/daemonStatus) or directly (the kolu unit’s env vskaval-tui); the build-skew case asserts they differ, then converge after the restart. Each green-on-correct, red-under-mutation, per the adoption tests’ discipline.
- Unit (rides
Traps:
- #1034 over-prompting is the whole hazard: key on the closure-hash
staleKey, never the per-deploynavigableCommit/commit;buildId.closure.test.tskeepsstaleKeyfrom moving on out-of-package changes, but B3.4 adds the direct falsifiable proof (server-only change → bit-identical → silent) the plan demands — #1326’s “stranded nix” was a missing assertion, not missing wiring. - The off-nix
""guard is mandatory: dev/test bake noKAVAL_BUILD_ID, so both ids are""— never"" !== "hash"firing in a half-baked env, never"" !== ""meaning anything. - Correct the now-stale rationale.
common/src/surface.ts’s build-identity comment still says kaval’s identity is display-only and “folding kaval’s commit into staleness buys little today” because always-recycle “precludes a kaval skew older than one boot” — B3.3 adoption invalidated that premise. B3.4 rewrites it: kaval’sstaleKeyis a staleness input now, but as a separate “update pending” nudge on thekavalcolumn — not folded into the client-vs-serverisStale/≠ srvsignal (which stays the clean-ref commit comparison). Left stale, the new derivation reads as dead-by-design. (While editingIdentityRail.tsx, fix its JSDoc too — it still says the column readsbuildInfo.ptyHost, but the code readsdaemonStatus.identity.) - Orthogonal overlay, not a state: no “update-pending” row in
DAEMON_STATE_PRESENTATIONand no wire endpoint-state the supervisor never emits — a build-behind daemon isconnected. Source the reportedstaleKeyfromdaemonStatus.identity, not the deadbuildInfo.ptyHostaxis. - Additive-optional, no contract bump: the
expectedKavalfield is server↔client on koluSurface and must be optional — it does not touchPTY_HOST_CONTRACT_VERSION(the daemon↔server contract), so it force-restarts no surviving daemon (the very hazard B3.3 closed). - Spine: none. B3.4 adds zero supervisor mechanism — it reuses B3.2’s
daemon.restartand B3.3’s adopted survivor unchanged; the comparison is kolu’s soul (build/session policy), the spine/soul line every B3 PR draws. (The daemon never receivesKAVAL_BUILD_IDvia spawn env —localDriverdeliberately doesn’t forward it; the daemon reads its own wrapper, the server its koluBin wrapper, both--setfrom one${kavalBuildId}. That shared nix value is what makes a together-built deploy’s expected == reported, and what the CI gate pins.)
Verify: the unit truth table + the read-site echo; the over-prompting falsifiable test (server-only change → no nudge); the VM tier — adopt.nix asserts no nudge on a no-op redeploy, the build-skew check asserts the nudge fires and a confirmed restart recovers on the new build, each red under a deliberate mutation. The staged-prod gate below — a redeploy that flips kaval’s staleKey → amber nudge → confirmed restart → recovery honest — is B3.4’s acceptance signal; CI structurally cannot redeploy over a live daemon. Dep: B3.2 (the restart it fires) · B3.3 (reachability — adoption is what keeps a build-behind daemon alive to nudge about).
The gate — the second deploy (not a PR)
There is one kind of deploy — you ship the whole kolu closure and the server restarts; the daemon’s fate is keyed only on whether that deploy moved kaval’s staleKey. CI structurally cannot exercise “redeploy over a live daemon”; the acceptance signal is a staged prod checklist, a planned step: deploy B3 → open terminals + a running agent + a split → redeploy with kaval’s staleKey unchanged (no nudge; survivors adopted; the reattach: counts reconcile; the split still nested; the agent still running) → redeploy that flips kaval’s staleKey (amber nudge → confirmed restart → recovery honest; a deliberately failed respawn leaves the restorable session).
The #1034 postmortem — production failure and hard constraints
The first R4c-UI build (
#1034 ) shipped the build-mismatch “update pending” nudge + a Restart local PTY daemon command. On production (Linux/systemd, 2026-05-30) the nudge fired correctly on a deploy; clicking restart then destroyed a live 20-terminal session and could not bring the daemon back. From the journal: killAll drained all 20 terminals first; the old daemon (13.5h old, 25G RAM on a thrashing box) took ~2min to exit; the respawn timed out at 30s; the user was left with zero terminals, a dead daemon handle, and an empty canvas indistinguishable from “you have no terminals.” No second daemon ever ran — the failure was the respawn losing the race against the slow old-daemon exit.
Hard constraints (binding on the redo)
- Restart must be recoverable, never kill-then-pray. Snapshot the session first; after respawn, auto-reattach/offer restore; if respawn fails, a loud recoverable degraded state with the session preserved — never an empty canvas.
- Wait for the old daemon to fully exit — the single-instance lock fights the restart. Wait on actual process exit (
kill(pid,0)→ ESRCH) before spawning, with generous load-aware ceilings. - Timeouts must fit a loaded production box, not an idle dev one (20 heavy PTYs + a tsx cold-start under swap ≫ 30s).
- Never lie about state — an explicit connecting/degraded UI, never silent emptiness.
- A persistent PTY status indicator, always visible — connected · connecting · degraded/dead · update-pending (became A2’s rail).
- Key staleness on pty-host, not the whole binary — hash the
@kolu/pty-hostsource closure so the nudge fires only when a restart actually gains something (became A2’s staleKey; B1 re-roots the closure at the daemon entry). - Back up the session before a destructive restart (Export/Import session shipped via #1046 ).
- Fix the preferences storm ( #1041 , fixed in #1050 — coalesced writes).
Execution lessons (process, not product)
From building + salvaging #1034, re-confirmed during #1275: typecheck every package before commit (a clean nix build is not proof CI passes,
#1049 ); a new workspace package has a checklist (default.nix explicit fileset, staleKey-closure decision, lockfile, vitest wiring) — the fileset miss broke CI once already; never git stash in the worktree; grep for unresolved conflict markers before committing a merge; never defer with a someday-issue — gaps go into the plan and get done in the PR; when the terminal contradicts itself, confirm via an independent path before “fixing” a phantom. Wire-shape rules: additive wire fields are optional + no contract bump (a required field force-restarts a surviving older daemon just to add a diagnostic); the identity surfaced to users is the navigable one (commit SHA), with the source hash kept as the staleness key only. And the #1275-specific one: CI-green is necessary, not sufficient — the second deploy is the acceptance signal; budget the staged-prod loop as a planned step with the diagnostic logging built in advance.
Design notes that carry forward
The survivor is kaval only — node-pty fds + the @xterm/headless mirror + the raw VT taps + a unix socket + its own process entry. A kolu-server restart re-runs the providers against the surviving PTYs, so detection is never stale while the PTYs persist. Honest cost (accepted): metadata is no longer “warm” across a restart — a brief re-detection pass, trading warm-on-reconnect for freshness-on-deploy. Only a pty-host contract change (rare) forces terminal loss; a provider change (frequent) restarts the cheap layer with PTYs untouched. This corrects #1031, which daemonized a survivor that held the providers.
The cgroup mechanism (spike-verified). On Linux/systemd the daemon spawns via systemd-run --user (gated on INVOCATION_ID), landing in its own transient cgroup — a plain detached/setsid child does not survive on cgroup-v2 (KillMode=control-group walks cgroup membership, not the session — the #1031 Linux failure). macOS’s detached spawn already survives launchd. Caveats folded into B2: linger must be on; absolute daemon path (minimal unit PATH); per-spawn unique unit names (a dead unit can linger loaded). Single-instance two ways: the unit name plus the atomic pid-gate.
tmux/dtach considered and rejected: they only keep a PTY alive — no OSC-parsed taps, no headless snapshot for lazy-attach, no home for the provider DAG. You’d still build pty-host’s streaming layer next to tmux and inherit its session model on top.
History
- R4a #1023 + R4b #1028 shipped clean as pure in-process refactors.
- Two prototypes seeded the design: #994 (remote) proved “providers in the agent”, re-provisioned fresh per version; #1010 was a PTY-only local daemon — the thin-survivor model. The redo is their correct combination.
- R4c first attempt
#1031 dropped after a macOS prod failure: it daemonized pty-host plus the provider DAG, so the survivor served stale detection on every deploy — and staleness keyed on the inert
pkgVersionconstant, sooutdatedwas always false and a 20-hour-old daemon was reused forever. - R4c redo #1034 — the postmortem above; shipped + salvaged, then discarded by the fresh approach.
- The fresh-approach redo (2026-05-30/31): #1055 (A1) landed the in-process foundation and removed every #1034 daemon artifact; #1059 unified the link family; #1063 (A2) landed identity + the closure-hash guard + the rail. kolu-tui Phases 0–2 ( #1073 / #1084 / #1255 ) then battle-tested the socket transport in production.
- Phase B first build
#1275 (2026-06-11) — the whole feature as one 40-commit PR: functional, verified live across staged prod deploys, and discarded. What it proved: the user-facing spec works end-to-end. What it taught (each now a row in the impossible-by-construction table): the
@kolu/pty-host-daemonpackage was extracted mid-PR after the review gauntlet; reattach correctness arrived as three sequential prod fixes (parentId, contract-on-adopt, lastAgentCommand); identity freshness was patched three times; the staleKey hash needed three lockstep file-level exception sites. The redo above designs each of those in. - This plan, first cut (2026-06-11): the three-PR redo — B1 (daemon binary, inert) → B2 (the door, survival off) → B3 (survival) → the staged-prod gate.
- The kaval reframing (2026-06-12): the daemon is kaval (Tamil kāval — watch, guard), a standalone program in the drishti/odu graduate tradition, with kolu as its first client — dumb-but-durable kaval ← kolu the session brain ← kolu the web UI. A coupling audit found four real ties to kolu: no bin entry, spawn policy baked daemon-side (
cleanEnv/koluIdentityEnv/prepareShellInit), the hardcoded socket app-name, and kolu’s build-id env vars. The split becomes four PRs: B0 inverts spawn policy onto a fully-specified wire (spawn {argv, env, initFiles}+system.infohost facts) while everything is still in-process — the contract change made at the one moment it is free; B1 ships the renamed binary + kaval-tui; B2/B3 unchanged in substance. The inversion also deletes B2’s env-forwarding hazard class by construction and makes R-2’s remote host asurface-nix-hostdeployment of the same kaval closure. Plan PR: #1291 . - B0 shipped
#1292 (2026-06-12): the inversion landed as briefed — contract 3.0, all policy recomposed server-side (
buildTerminalSpawnInputcomposescleanEnv· identity env · a now-pureprepareShellInitagainstsystem.info), zerokolu-*workspace deps with the closure test’s allowlist as the permanent guard, parity tests in place of a golden file. The gauntlet’s one real find: init-file cleanup on partial-write and spawn failure. One brief-vs-shipped delta worth recording:system.infois read once per process and cached, not per connection — revisit when B2 makes connections real. - The spine named
#1294 (2026-06-12): surface-daemon deduplicates kaval B1/B2’s lifecycle machinery against
odu serve’s identical needs, before either hand-rolls a second copy. Phase split unchanged — the spine is born in B1/B2 as planned; what changed is layout discipline (the spine/soul constraints now in the B1/B2 traps) and the extraction’s destination + trigger (post-B, S1, into@kolu/surface(-daemon)). - B1 shipped
#1301 (2026-06-12): the rename landed as briefed —
pty-host→kaval(withbin.ts+ a ~20-linedaemonMain.tscomposition),pty-tui→kaval-tui(default socket appkaval,--pty-host-socket→--socket), identity envKAVAL_BUILD_ID/KAVAL_COMMIT_HASH.@kolu/surface-daemonborn as the daemon half (acquirePidGate+ the gate’sgatePid/isHolderLivefile-format primitives + thedaemonMainskeleton), hashed as a third staleKey root; the closure test re-rooted at two entries (index.ts+bin.ts). Full e2e: the contract corpus over both links, a coverage ledger walkingptyHostSurface’s keys, and the real-process daemon scenarios. A live smoke confirmed the standalone pair (kaval + kaval-tui, kolu nowhere) before CI. Review gauntlet: lens consensus (Logger dedup into the spine, the supervisor readerreadPidGateremoved in favour of daemon primitives the supervisor composes,daemonExitCodeonto the spine type,ptyHostSrc→kavalSrccross-refs); codex (xhigh) found 6 — a blocking Home-Manager-VM-smoke miss, a gate-dir-privacy check before honouring a pid, the shipped wrapper switched tonode --import tsxso SIGTERM reaches the daemon, plustry/finallyteardown and rename sweeps. - B1 re-scoped (2026-06-12, same day, on plan review): two additions.
@kolu/surface-daemonis created in B1 itself, daemon half only (the gate + thedaemonMainskeleton) — for review isolation (the mechanism reviewed once, as a package; the rest of B1 stays renames + a ~20-line composition) and one home for the gate’s file format; the supervisor half follows at S1, kept out until then by the whole-package staleKey invariant. And B1 gains full e2e coverage: the contract corpus instantiated over a real spawned daemon’s socket, a coverage ledger walking the contract’s keys, and the daemon-only lifecycle scenarios (gate races with real processes, SIGTERM/SIGKILL honesty, initFiles across the process boundary, kaval-tui against the real daemon). - S1 packaging decided (2026-06-12, on review of this plan): the supervisor half extracts at S1 into a separate
@kolu/surface-daemon-supervisorpackage, not a/supervisorsubpath of the daemon package. The daemon and supervisor halves are on different volatility axes by construction (the staleKey must not move on a supervisor edit), so the package boundary is the honest module boundary — and it is the hash boundary, with nofileFiltersubdir glob to mis-scope the key. Shared types cross a one-directionalworkspace:*edge. The rejected “matches@kolu/surface’s server/client split” and “graduates as a unit” arguments don’t hold (the first is circular; surface-daemon is spine that stays in-repo). Full reasoning in surface-daemon. - B2 re-scoped (2026-06-12, on B2 planning):
@kolu/surface-daemon-supervisoris born in B2 itself —endpoint.ts/waitForPidGone.ts/restart.tsstart life in the package; onlylocalDriver.ts(soul) and the composition stay inpackages/server. The S1 deferral failed a three-front audit: “wait for the API to settle” is a backwards-compat argument with no compat obligation behind it (one downstream, edited in the same PR); “wait for the second consumer” lost to the precedent that birthed the daemon package in B1 with one consumer; and the quantified risk is ~five scaffolding files plus onedefault.nixfileset line, zero hash surface (the package is deliberately un-hashed). Benefit: the spine/soul trap is enforced by the package boundary instead of reviewer vigilance — the #1275 mid-PR-extraction lesson applied in advance. S1 shrinks to one move: the handshake fragment into@kolu/surface. The same audit then redrew the driver line: the survivable-spawn mechanism (theINVOCATION_IDgate, the systemd-run/detached fork, unique-unit-name discipline) is host-platform volatility, not kolu volatility, so it ships in the package as the defaultDaemonDriver(survivableSpawnDriver) —localDriver.tsshrinks to kaval’s parameter bundle (binary · dev-flag filter ·--setenvvalues · paths · unit prefix), further cutting B2’spackages/serverdiff surface. - B2 shipped
#1310 (2026-06-12): the topology flip landed as planned — kolu-server stops running the pty-host in-process and is now a client of a spawned kaval daemon over its unix socket (a stable
makeForwardingClientProxy over the live endpoint keepsLocalTerminalBackenduntouched),@kolu/surface-daemon-supervisorwas born in B2 (generic over<C, I>, zerokolu-*deps, not a staleKey root), and the honest degraded state surfaced. Three things implementation/the gauntlet added beyond the brief: (1) per-instance socket isolation —KOLU_KAVAL_SOCKEToverrides the app-name-derived path so parallel e2e workers (and two checkouts on one box) never share a daemon (codex caught the missing isolation as blocking) — this B2-shipped env-override form was later superseded: #1313 made per-port keying the default and demotedKOLU_KAVAL_SOCKETto a whole-path override (see History below); (2) the spawn-strategy gate isfromSource, not rawINVOCATION_ID— every systemd-session shell setsINVOCATION_ID, so dev/e2e (whereKOLU_KAVAL_BINis unset) force detached spawn, and only production with the wrapped binary takessystemd-run --user; (3)ENDPOINT_STATESlives in a zero-dependpointStates.tsleaf re-exported by the package, so browser-sharedcommon/surface.tsderives theDaemonStatusSchemaenum without pulling node deps. Also shipped: a live KAVAL rail column (dot · uptime · the daemon’s commit/hash, read from the server-internaldaemonStatuscollection keyed by hostId — not the pre-B2 in-processbuildInfo.ptyHostaxis) that opens an info dialog showing identity + how tokaval-tui list/attach/snapshot, and the kolu version bump 1.0.0 → 1.1.0 (kaval is a major change; next release 1.2.0). Lesson logged (e2e): a longXDG_RUNTIME_DIRscratch path made the bracketed-paste clipboard test flake on linux — the wrapped path triggered bash 5’s active-region redraw (\x1b[7m/\x1b[27mper char), garbling the buffer read; the fix is a short per-worker runtime dir, not a paste-timing tweak. CI: 26/26 nodes green both platforms. - Post-B2 follow-ups
#1313 ·
#1323 ·
#1320 (2026-06-12): three PRs landed on master after B2, all material to B3’s substrate.
#1313 fixed a production incident where a second kolu-server recycled the first’s daemon — both resolved kaval to one shared per-user socket (
$XDG_RUNTIME_DIR/kaval/pty-host.sock), and the always-recycle boot policy made a newcomer SIGTERM the incumbent. The fix namespaces kaval’s socket+gate per kolu-server instance by listen port (kaval-<port>/), so two servers can’t share a gate by construction; the barekaval/namespace stays for a standalone daemon, and a flag-lesskaval-tuinowdiscoverPtyHostSockets()the running daemon (there is no longer one well-known path). #1323 hardened that discovery after a retrospective/be-reviewfound a real security gap: on the shared/tmpfallback another local user could plant a name-spoofed socket and be dialed. The connecting side now re-checks the serving-side boundary — owner-only namespace dir (lstat: real dir,uid===getuid(), no group/other bits) and a real socket inode (lstat().isSocket(), symlinks rejected) — and derives the path grammar fromgetRuntimeSocketPath’s forward output so discovery can’t drift from construction. #1320 dropped thekaval-build-idIFD (the build-id is now pure Nix —builtins.hashString "sha256" "${kavalSrc}", norunCommand+readFile), fixing the eval-blockingnix flake checkfailure ( #1317 ) and adding aci::flake-checknode; the runtime still just readsKAVAL_BUILD_IDand the closure test still pins the hashed file set, so the staleKey guarantee is unchanged. - /tui → /kaval
#1314 (2026-06-12): the website’s standalone-pair page moved from
/tui(tui.astro) to a kaval-centric/kaval(kaval.astro) with a logo and the Tamil etymology. - B3.2 shipped
#1337 (2026-06-13): the supervised, session-preserving restart — B2’s deferred “Restart kaval” button, finished. Spine (
@kolu/surface-daemon-supervisor): a transientrestartingstate;endpoint.holdRestarting(the emit-guard — it coerces the recycle’s two transient transitions, the old connection’sdegradedclose and the fresh daemon’sconnecting, to one honestrestarting, while the terminalconnected/deadpass through to end the hold; and on a capture/drain failure before the recycle it recovers the honest current state so the rail can’t stick atrestarting); andserializeRestart, which coalesces concurrent triggers (a double-click, two clients) onto one in-flight recycle. Soul (packages/server):setSavedSessionFromSnapshot(the F1 receptacle, landed here beside its consumer per B3.1’s deferral) — an unconditionalcancelPendingAutosave()before persisting, so the capture survives the kill even when the session cell dedups the write and skips its ownonWritecancel;ptyHost/restartLocal.tsfilling the composed steps (capture = snapshot+persist before the kill, drain =killAllTerminals, reattach = a no-op — nothing survives a daemon kill, so restore is the existing card on the empty canvas, no autosave race); and thedaemon.restartRPC. The #1326 swamp stayed gone: the restart-case non-survivor is restored on the empty canvas, never adopted, so thependingRestoreCard/union cluster never appeared. UI: oneuseDaemonRestartaction behind the Restart button on both the kaval rail dialog (running/degraded) and the DegradedCanvas (dead/degraded); the rail dot pulsesrestartingthroughout, and a neutral “warming” canvas replaces the empty-state welcome while the daemon comes up. Tests: supervisor unit (onerestartingacross the recycle thenconnected, coalescing,dead-on-failed-recycle), a session-unit autosave-cancel race test, and an@kaval-restarte2e round-trip (kill → degraded → Restart → recover → restore card → restore → terminal back). Refined in review (codex xhigh, 3 rounds — each a real bug the gauntlet caught): (F1) an empty capture now preserves the existing saved session rather than clearing it — a restart from adeadboot (registry empty, but a prior run’s session on disk) would otherwise route the empty snapshot throughsaveSession’s empty→null and erase the only restore data before the recycle, the exact kill-then-pray data loss #1034 is about; (F3) a warming-window guard —useTerminalCrud.handleCreate/handleCreateSubTerminalrefuse while the daemon is warming and the canvas shows the neutral warming surface (not EmptyState’s enabled Restore/new-terminal affordances), so aCmd+T/restore in the post-drain window can’t spawn into the daemon the recycle is about to kill (or a momentarily-stalecurrentconnection); (F4) theholdRestartingcapture/drain-failure recovery above. The lens debate deduped the daemon-state presentation into oneDAEMON_STATE_PRESENTATIONtable + a sharedrestartInFlight/isWarmingpredicate;/simplifycollapsed the four-way canvas gate to a<Switch>. CI 26/26 green on both platforms. - B3.1 shipped
#1330 (2026-06-13): the first survival-chain PR, a pure byte-identical refactor.
ensure()split intoliveServingHolder/killLiveHolder/spawnConnectHold(the gate is the existingendpoint.test.ts, 16 green);killHalfWiredPtylifted fromspawnAndWire’s one inline catch (the F2 reap receptacle B3.3 shares); the inline{terminals, activeTerminalId}snapshot shape deduped into one exportedSessionSnapshottype spanning the producer (snapshotSession) and both autosave consumers. Scope refined in review — the F1 receptacle moved B3.1 → B3.2: an early cut shippedsetSavedSessionFromSnapshothere, ahead of any consumer; codex (xhigh) flagged a net-new exported helper with no production caller as future-scaffolding. The principle that settled it: the spine helpers andkillHalfWiredPtyare extractions of code that already runs (live consumers) — pure refactor; a consumerless guard is bare future-API — so it lands in B3.2 beside its restart-capture consumer, with a real autosave-cancel test. The lens debate (1 round) completed theSessionSnapshotdedup down to the producer;/simplifythen inlined the single-caller mapper the cut had orphaned, leaving the nominal type as the session change’s only net addition. Every remaining change has a live consumer — a true byte-identical refactor. CI 30/30 green on both platforms. - Plan refresh (2026-06-13, post-B3.2 currency pass — this PR): a read-through audit against the merged #1337 code corrected the bookkeeping B3.2 left slightly stale. The B3.2 hazard cell now reads F1 · F3 · F4 (was F4 · F5 · F6 — F5/F6 were defined nowhere; F3, the warming-window guard, was real but unlisted); the B3.2 soul row drops “the currency comparison vs the expected build” (that derivation ships in B3.4, not #1337); a one-line F1–F4 legend now defines the hazard family the note references; the closure-test line names both entry roots (
index.ts+ the daemonbin.ts, matching B1); and B3.3’s adopt path is anchored to the real symbols (adoptTerminalbeside the existingspawnAndWire,reconcileinterminalBackend). No B3.3 design change — the brief audited execution-ready. Companion notes resynced: the chrome-bar rail (now three columnssrv · client · kavalsince B2), the parent remote-terminals R-4 status, and surface-daemon’s supervisor-package contents (B3.2’sserializeRestart+restarting). - B3.3 shipped
#1344 (2026-06-13): adoption — live PTYs (process · scrollback · running agent) survive a kolu redeploy that didn’t change kaval’s source. Spine (
@kolu/surface-daemon-supervisor):adoptOrEnsure()— adopt a live, handshake-compatible survivor (connect, never kill) else recycle an absent / dead / skewed one (the deliberate opposite ofspawnConnectHold’s skew→dead, which is a fresh spawn’s genuine boot failure). It returns whether it adopted, andspawnConnectHold’s connection-holding tail was factored into a sharedholdConnectionso an adopted daemon reportsconnected— with the survivor’s olderstartedAt, the uptime that did not reset — identically to a fresh spawn. A survivor connect is retried (bounded) before the endpoint concludes skew and recycles: a one-off transport/handshake-read hiccup against a healthy survivor must not cost it its live PTYs — only a connect that fails every attempt is treated as genuine skew. Soul (packages/server):reconcile.ts(a puredaemon.list() × savedSession → {adopt, adoptOrphans}; a saved terminal with no live PTY is an exited shell, in neither list — dropped, exactly ashandleExitdrops a shell that exits with the server up);adoptTerminal, a sibling ofspawnAndWirethat wires an already-alive PTY from the whole record (adoptedMeta, never field-by-field) and re-runs the provider DAG, reaping a wiring failure through the sharedkillHalfWiredPty. The live daemon snapshot is the authority forcwd/foreground(kaval’s cwd/title taps don’t replay a snapshot on subscribe, so acdwhile kolu-server was down would otherwise stick to the stale saved value and be re-persisted over the live truth). The boot reattach adopts every live PTY, never reaping: a live PTY with no saved record (a create that raced the 500 ms-debounced autosave — the common redeploy window) is adopted from the live snapshot (orphanMeta), since killing it merely because the debounced session lagged the daemon would break the headline survival guarantee; it carries no saved id, so re-adopting (not re-spawning) keeps #1275’s duplicate-terminals impossible. It then converges the saved session to exactly the adopted set (exited shells drop, no stale restore card; an all-exited survivor clears it). A failure to list the survivor’s PTYs is a fail-closed condition: the boot recycles the adopted daemon rather than leaving it connected with PTYs kolu never registered (invisible live terminals behind a stale restore card). UI: a one-shot “N reattached” toast keyed on a new optionalDaemonStatus.adopted— kolu’s soul, not the spine’sEndpointStatus; additive, no contract bump. #1275 closed by construction: whole-record adoption + a schema-key round-trip test (an exhaustive sentinel fails CI if a new persisted field is added but not threaded). The #1326 swamp stayed gone: an adopt-case non-survivor is a dropped exited shell, never restore-carded. No kaval-hashed root touched →KAVAL_BUILD_IDbit-identical, no forced restart. Tests:reconcileunit, the round-trip (incl. live-cwd-wins + orphan-adoption), andadoptOrEnsureintegration (adopt / retry-then-adopt / recycle-on-skew / fresh) against a real spawned daemon; full redeploy-survival is the staged-prod gate (CI structurally can’t redeploy over a live daemon). Two brief-vs-shipped deltas: the boot callsadoptOrEnsure()directly (it returns whether it adopted) rather than threading throughrestart()— the boot doesn’t kill, sorestart()’s kill-based shape didn’t fit; andreconcilereturns{adopt, adoptOrphans}(the brief’srestoreCardwas subsumed — the no-survivor / fresh-daemon path never reachesreconcile, so the existing client restore-card path handles it unchanged). CI gauntlet + both-platform CI: see the PR. - Plan refresh (2026-06-14, post-B3.3 currency pass — this PR): with B3.3 on master, a read-through audit against the merged code made the one remaining PR execution-ready. B3.4 gained a full executable brief (§ B3.4 — currency nudge), matching B0–B2’s Goal/Deliverables/Traps/Verify depth — it was previously only a four-PR-table row + the stands-alone paragraph. The audit grounded every primitive in the shipped code and resolved the one open design question the prior text left hand-wavy — where the comparison’s two operands meet: the reported id (
daemonStatus.identity.staleKey) already reaches the browser, but the expected id (the server’s ownKAVAL_BUILD_ID, baked onto the koluBin wrapper by the same${kavalBuildId}the kaval bin gets) is read by no server code today. The brief’s call: surface it as an additive optionalexpectedKavalonbuildInfo, replacing the vestigialbuildInfo.ptyHostrelay (the rail already reads the daemon’s reported id fromdaemonStatus, never that axis), and deriveexpected !== reportedat the read site — a server-foldedupdatePendingboolean (mirroring B3.3’sadopted) was considered and rejected against the plan’s read-site mandate and the existingclientStaleprecedent. Three execution hazards the audit surfaced and the brief now pins: the nudge must key onstaleKeynevernavigableCommit(the git ref moves every deploy — #1034 reborn); the off-nix""build-id must guard silent on both sides (theisCleanRef/DEV_COMMITanalog); andcommon/src/surface.ts’s build-identity comment is now stale — it still calls kaval’s identity display-only because always-recycle “precludes a kaval skew older than one boot,” the premise B3.3 adoption removed — so B3.4 must rewrite it (kaval’sstaleKeyis a staleness input now, as its own nudge, not folded into the≠ srvsignal). The CI gate is specified two-tier: a unit truth-table + expected-id-echo, and an end-to-end VM tier ridingci::home-manager—adopt.nixasserting no nudge on a no-op redeploy (the #1034 failure mode) plus a build-skew sibling ofskew.nixvia a new test-onlykavalBuildIdOverridenix seam (the nix-value analog ofcontractVersionOverride— a${kavalBuildId}substitution, not a source sed — with the headless VM observing the comparison’s two operands, not a rendered chip). One correction folded in: the daemon gets its build-id from its own nix wrapper, not vialocalDriver’s spawn--setenv(an inaccuracy the B2 brief implied). Companion resync: the parent remote-terminals R-4 status (B3.3 moved on-master; B3.4 the lone remainder). No B3.4 design change — the brief audited execution-ready. - B3.4 shipped
#1353 (2026-06-14): the currency nudge — the final B3 PR, completing the survival chain (B3.1–B3.4). Built to the brief. Server surfaces
buildInfo.expectedKaval(its owncurrentBuildId()— the build it would spawn) and retires the never-readbuildInfo.ptyHostrelay; client deriveskavalStale(expected, reported, state)at the read site (extracted to a purekavalCurrency.tsso its truth table is unit-tested without mounting thedaemonStatussubscription, thecanvasModeResolverprecedent) and renders an amber⬆ updatechip on thekavalcolumn. The stalecommon/surface.tsrationale + theIdentityRailJSDoc were corrected as the brief required. Three as-built deltas from the brief: (1) the rail chip is a passive indicator that routes the destructive recycle through the existingKavalInfoDialog(whereRestartKavalButton’s inline confirm + the running-vs-expected detail live) — a cramped rail inline-confirm would blow up the glanceable strip, and the column is already a button (no nesting); (2) an adopt-time server currency diagnostic log (running=<X> expected=<Y>, raw values) is the VM gate’s observation channel (a headless guest can’t see the rendered chip) and a genuine operator breadcrumb — the comparison-with-guards stays solely inkavalStale, this logs facts; (3)currency.nixis the cheap skew check —kavalBuildIdOverrideonly rewrites the wrapper’s--set, sokoluNewshares thekoluclosure (no second full build, unlikeskew.nix’spostPatchsed). The gauntlet (lens + codex consensus — no functional bugs; simplify clean; code-police one fact-check fix) + both-platform CI (28/28) + the pu-verified VM run (green-on-correct viaci::home-manager, red-under-mutation on a KVM box — a deliberate “server surfaces the wrong expected” bug madeadoption-currencygo red while the survivor was still adopted) all passed. Dialog follow-ups (post-deploy refinements, same PR): the “update available” banner links the running + expected builds as clickable git commits (not the nix closure staleKeys — those aren’t GitHub-navigable) plus a path-scoped “what changed in kaval ↗” history link (commits/<expected>/packages/kaval— GitHub can’t path-filter a compare diff, but it can a commit history); and the dialog now shows the daemon’s unix socket path, surfaced as an additive optionalDaemonStatus.socketPath(a server fact the client can’t construct — it doesn’t knowXDG_RUNTIME_DIR).