herdr vs. kolu — what to adopt

A study of ogulcancelik/herdr (cloned @ HEAD) read through kolu’s remote-terminals plan ( #951 and the pty-daemon / kolu-tui / chrome-bar docs). Sibling to the Ghostex vs. kolu remote-terminals analysis. Every load-bearing claim below was fact-checked against both codebases — herdr citations all verified; three kolu-side claims were corrected (noted inline). 13-agent workflow + adversarial critique

Headline

herdr makes the same bet kolu chose: a first-party process owns every PTY, with thin clients attaching over a unix socket. That is the opposite of Ghostex (external mux owns PTY lifetime) — so herdr sits on kolu’s side of that line and is a shipped, battle-tested implementation of exactly the problems R-4 Phase B is about to build: survival across restart, transactional recovery, lazy snapshot-attach, even live fd handoff. Ghostex told us the seam is natural; herdr shows us the mechanism for the survivor we already decided to build. Two of kolu’s plans — kolu-tui and agents-orchestrate-kolu — are essentially herdr, already shipped. The highest-value output is two ideas with no current plan coverage: native agent-session resume, and multi-client resize arbitration (kolu has no arbiter today — resize is last-write-wins, so two differently-sized web clients already thrash; kolu-tui attach, now shipped, drives the PTY from its own terminal size and only sharpens the mismatch).

Both projects are AGPL-3.0-or-later, so herdr’s open-source code is license-compatible with kolu — porting is permitted under the AGPL’s terms, not blocked. The reason most recommendations are techniques and design rather than verbatim code is the stack gap (herdr is Rust; kolu is TypeScript/SolidJS), not licensing.

herdr is a ~107K-LOC Rust TUI agent multiplexer: one binary, a long-lived background server that owns every PTY (one ghostty VT emulator + one OS-thread PTY actor per pane), thin clients that attach/detach over a unix socket. Workspaces → tabs → panes. An agent-awareness sidebar rolls each workspace up to its most urgent state (blocked / working / done / idle). A second socket exposes a JSON API so agents themselves can create panes, read output, and wait for state. Named sessions, remote-over-SSH, 14+ agent integrations, and an experimental zero-downtime live handoff.

The architectural contrast

Module correspondence. SOLID edges = herdr validates a decision kolu already made, or a direct borrow. DASHED edges = a gap or an explicit non-goal. herdr owns one long-lived server; kolu's R-4 inverts the survivor to @kolu/pty-host while the provider DAG runs fresh in kolu-server.

Concern	herdr	kolu (built + planned)
Who owns PTY lifetime	First-party long-lived server owns every master fd; clients are stateless front-ends.	Same bet. R-4 makes `@kolu/pty-host` the thin survivor; the volatile provider DAG runs fresh in kolu-server.
Survive restart	Server outlives clients; full restart restores from a snapshot; `resume_agents_on_restore` respawns agents.	kolu-tui = client detach/reattach; Phase B = daemon survives `systemctl restart` via cgroup-escape + reattach-by-id. The #1034 hazard lives here.
Recovery on owner restart	Transactional: old owner stays alive and re-binds sockets until the new one acks; one bool gates who may signal children; injected-failure tested.	Phase B’s composed `captureSession → drainTerminals → respawn → finalize` with `waitForPidGone` — designed to never repeat the “kill-then-pray” loss.
Late / lazy attach	A live screen snapshot, never a byte replay: reset baseline → re-render the live emulator into one full frame.	Same: `ptyHost.ts` subscribes then serializes a `snapshot \| delta` union (~4KB).
Renderer	Server diffs a cell-grid → ANSI. No web terminal.	Raw VT → `xterm.js` in the browser; the headless mirror is for snapshot + taps only.

Architecture — what to adopt

A1 · Transactional handoff discipline → Phase B recovery

high direct herdr’s restart is choreographed: the old owner stays alive and re-binds its sockets until the new one acks owned, and exactly one boolean (preserve_processes_on_drop) decides who may signal children — so a failed migration is a structural no-op and double-kill is impossible. Tested with injected import failures. (herdr src/pane.rs:555 (flag), :689-702 (Drop); server/headless.rs:734-828; tests/live_handoff.rs:1279,:1359,:1442.)

Borrow the discipline, not the mechanism: make “who may kill these PTYs” a single structural flag (kolu’s killAndUnwatch ordering is already that shape) and write the injected-respawn-failure test RED first. One adaptation: herdr’s rollback resurrects the old process; kolu’s forced path kills the daemon, so kolu’s rollback analog is restore-from-snapshot (setSavedSession winning the autosave-cancel race), not resurrect.

A3 · Snapshot-on-attach — promote to an invariant test

medium direct herdr brings a connecting client current from the live emulator screen, never by replaying history (reset baseline → one full frame). (herdr render_stream.rs:31-36; headless.rs:1639.) kolu does the same — attach() subscribes before it serializes, both synchronously, so each chunk lands in exactly one of snapshot/deltas. (verified: packages/pty-host/src/ptyHost.ts:501-509.) Add an invariant test (subscribe-during-burst sees no gap/overlap), and cap R-3 migration-only replay at a small bound like herdr’s 8KB/pane. The keyboard-protocol question (G3) was answered in #1255 : @kolu/terminal-protocol’s SNAPSHOT_TTY_RESET is the reciprocal of SerializeAddon 0.14.x’s mode vocabulary (alt-screen, mouse, bracketed paste; kitty keyboard is not serialized by 0.14.x — audit on every xterm bump).

A7 · SCM_RIGHTS live fd-passing — explicit NON-GOAL

drop category-mismatch herdr passes live master fds via sendmsg/recvmsg ancillary messages because its whole server restarts. (herdr server/handoff.rs:370-454,:26 MAX_FDS_PER_HANDOFF=64; handoff_runtime.rs:5-21.) kolu inverts this: the pty-host is the survivor and stays alive across a deploy, so there is no fd to move on the common path. node-pty exposes no master-fd handle. It would matter only for upgrading the daemon binary itself — which the cgroup-survival design deliberately makes rare. Write it down as a named non-goal so a future contributor who reads this note doesn’t reintroduce the #1034 race for marginal benefit. (A6 remote-over-SSH: borrow herdr’s reattach-hint UX, not the transport. The earlier “C1 pool-key leak” R-2 blocker is stale — hostSession.ts:735 already keys on (host, binary) with .drv excluded, fixed #1054 .)

UX — what to adopt

U1 · Unified attention rollup with 'Done = finished-but-unseen'

high adapt herdr’s highest-value UX idea: a 4th state Done = (Idle, seen=false) and a single pane_attention_priority (Blocked > Done-unseen > Working > Idle-seen) that feeds the sidebar dot, tab/workspace rollup, mobile summary, navigator, and the wait agent-status API — defined once, never drifts. (herdr aggregate.rs:66-74; api_helpers.rs:70-81.) Add a per-terminal seen bit keyed off canvas focus/visibility, derived in the live-metadata layer (not persisted — avoid the autosave firehose), feeding the dock badge, a palette “jump to next unreviewed,” and later the kolu-tui list column.

Caveat: kolu’s existing unread attention bit (verified: useViewState.ts:14,:71 — "unread" \| "badge-only", keyed by terminal id, cleared in activate()) is a separate signal from agent-turn-finished. Keep them as distinct inputs to one rollup rather than overloading unread — conflating them regresses the badge.

U2 · Agent-state model — 'blocked' equivalent now ships via screen scrape

medium adapt herdr’s detection breadth (process-name + output heuristics + socket-API hooks across 14 agents) validates kolu’s agent-agnostic philosophy. The claude-code state enum is thinking \| tool_use \| waiting \| awaiting_user \| running_background — still no literal blocked — but kolu now produces the blocked-equivalent: awaiting_user fires via the #905 screen-scrape recovery, which recognizes AskUserQuestion and tool-permission prompts on the rendered screen (screen.ts) while the dialog is visible. (verified: packages/integrations/claude-code/src/schemas.ts:35-55.) Adopt working/idle/done-unseen now, and map awaiting_user into the rollup as Blocked. The hook side-channel herdr uses (PreToolUse/PermissionRequest → state, with defensive discipline: temp-file stdin, short timeout, swallow-all, exit 0) is exactly what #905 proposed — kolu shipped that signal recovery, though via the screen scrape rather than hooks.

Gaps the plans don’t cover

These are the highest leverage: herdr ideas with no current plan coverage. One is real and clean; one is a pre-existing condition the plans never arbitrate.

G1 · Native agent-session resume (claude --resume <id>)

missed adoptable high herdr restores a finished agent by replaying its native session id through a strict data-not-shell-text argv — an allowlisted {source, agent, kind, value} ref validated for length caps, no control chars, absolute paths, so a hostile id can’t shell-inject. (herdr agent_resume.rs:99-140, :160-169.) kolu’s restore launches the continue form (claude -c / codex resume --last via resumeAgentCommand) — most-recent-conversation-in-cwd, not the terminal’s exact session. (verified: useSessionRestore.ts → resumeAgentCommand; agent-cli.ts:141-144.) No native session id is persisted — the claude-code integration’s live session-JSONL watcher surfaces a sessionId, but it never lands in ServerPersistedTerminalFieldsSchema (cwd/git/lastAgentCommand/lastActivityAt only) — so two terminals sharing a cwd can resume the wrong conversation. The work: persist that native session id, offer claude --resume <id> with herdr’s injection-safe ref model. Upgrades a feature kolu ships in a weaker form.

G2 · Multi-client geometry arbitration — no arbiter today

pre-existing gap high Verified: kolu has no resize arbitration. ptyHost.ts:567 is last-write-wins (no per-client size, no foreground concept), the router forwards it directly, and each client drives resize from its own xterm grid via a ResizeObserver (Terminal.tsx:361). Because attach() fans out to multiple concurrent subscribers (channel.ts), two differently-sized clients already thrash today in the pure-web case — a desktop browser + a phone over --host 0.0.0.0 — so this is not introduced by kolu-tui. kolu-tui (shipped in #1255 ) resizes the PTY to its own terminal’s grid on attach and on every local resize, making the mismatch sharper. The plan endorses the shared-PTY case (“feature, not bug”) and has kolu-tui issue SIGWINCH → resize, but never specs an arbiter. herdr solved exactly this with a single foreground_client_id whose size drives shared geometry. (herdr headless.rs:115,:492-525; effective = foreground/most-recent client.) Phase 2 shipped ( #1255 ) with the policy documented as last-resize-wins (attach.ts) — a true arbiter (foreground client / fit-to-smallest, plus the size-change tap attach.ts names) remains open.

G3 · Smaller gaps worth a look

investigate

Two-tier persistence + privacy: herdr splits structural session.json (always written, no bytes) from opt-in, deletable session-history.json (scrollback). kolu’s @xterm/headless snapshot is live-memory only, so a cold daemon restart loses all scrollback. Decide whether an opt-in scrollback-to-disk tier is wanted.
Keyboard-protocol / alt-screen preservation: herdr carries kitty-keyboard / bracketed-paste bytes across migration because the child won’t re-emit them. Resolved by #1255 : @kolu/terminal-protocol (snapshotReset.ts, bracketedPaste.ts) enumerates and resets exactly the modes SerializeAddon emits — remaining caveat: an xterm/serialize upgrade that starts serializing kitty-keyboard must extend the reset. (Pairs with A3.)
Control ⟂ data channel invariant: herdr keeps a reliable unbounded control channel separate from the droppable render channel. kolu drops the whole subscriber on overflow (channel.ts:120-130, ends the iterator with no overflow-vs-exit distinction). Make “control-plane events never ride the droppable substrate” explicit, and for R-3 emit a distinguishable overflow vs exit frame.
Retryable-error working-hold: herdr treats provider 5xx/overload as continued Working with a grace before flipping, so transient API failures don’t flicker the pane to done. kolu has no equivalent.
Non-interactive TTY-guard for kolu-tui: herdr hard-requires an interactive TTY for destructive prompts. kolu-tui runs agent-driven and in CI — attach (with its ~. escape) already hard-requires a TTY (shipped in Phase 2: non-tty fails loud, pointing at snapshot); define the equivalent contract for kill/spawn before Phase 3.

What to do next

Phase B, now (low risk): adopt the transactional handoff discipline (A1), the single-owner kill invariant (A2), the snapshot invariant test (A3), the two-axis honest-state + inline-recovery-hint (A4). Add the SCM_RIGHTS non-goal note (A7).
Close G2 (multi-client resize arbitration) — Phase 2 shipped with documented last-resize-wins; an arbiter (and the size-change tap attach.ts already names) is still open.
kolu-tui: A5’s socket path and G3’s attach TTY-guard shipped ( #1084 , #1255 ); carry the non-tty contract forward to Phase 3’s kill/spawn.
UX: ship the attention rollup with Done = unseen (U1), keeping unread-bytes distinct from turn-finished; fold the navigator into the palette (U3).
Investigate G1 (native --resume) — the clearest missed adoptable; builds on data kolu already has.
U2’s blocked signal: #905 shipped (screen-scrape awaiting_user) — remaining: map awaiting_user into the U1 rollup as Blocked. Later: R-2 reattach-hint UX (A6).

Net: herdr is the reference implementation for the survivor kolu already chose to build — most of its architecture validates R-4 rather than redirecting it, with one battle-tested checklist to harden Phase B (A1) and one explicit non-goal to write down (A7). The durable surprises are the two gaps: native session resume (shipped in weaker form) and multi-client resize arbitration (latent in an already-endorsed feature).