Kaval PTY-host heap OOM · the kolu Atlas

Production postmortem — pureintent, the always-on kolu.service. Latest crash 2026-06-19 14:26; the fifth with an identical signature since 2026-05-27. Surfaced by a “why did kolu restart?” investigation.

Where the leak lives, and why kaval ran blind: the home-manager diagnostics option reaches the server only; localDriver strips the heap-snapshot flag from NODE_OPTIONS before kaval — the one process that actually OOMs.

What crashed

Two processes, one cascade. The kaval PTY host — the daemon that holds the live PTY file descriptors for every terminal (kaval/src/bin.ts) — exhausted its V8 JavaScript heap and self-aborted. Losing its only PTY host, the kolu server then fail-fast-exited, and systemd restarted it.

Time (EDT)	Event	Evidence
14:25:49	kaval GC pinned at the ceiling: `Mark-Compact (reduce) 4083.2 → 4058.0 MB`, then `FATAL ERROR: Ineffective mark-compacts near heap limit — JavaScript heap out of memory`	kaval unit journal
14:25:49	kaval aborts: `node::OOMErrorHandler → abort → raise`	coredump stack, thread 329839
14:26:08	Coredump captured — 732 MB compressed (`Signal 6 / ABRT`); kaval socket closes	`coredumpctl info 329839`
14:26:09.04	kolu server sees `[@kolu/surface/links/stdio] outbound write error: read ECONNRESET`, then a storm of `pty-host tap subscription failed` / `terminal.spawn failed`	`kolu.service` journal
14:26:09.07	kolu `FATAL … uncaught exception` → `Error: write ECANCELED`	`kolu.service` journal
14:26:09	systemd: `Main process exited, status=1/FAILURE` → `Failed with result 'exit-code'` → `Scheduled restart job, restart counter is at 1`	`systemd[1274]`
14:26:09	New server (`serverId 1010bc7d`) up with a fresh kaval on the same socket — healthy	`systemctl --user show` (`NRestarts=1`)

This was not a kernel OOM-kill (signal is 6/ABRT, a userspace self-abort — not 9/SIGKILL; journalctl -k had zero OOM lines; the unit’s MemoryMax=infinity), not a deploy (status=1/FAILURE + scheduled restart is a crash, not a clean stop/start), and not disk-full (/ at 78 %, 190 GB free). The kolu exit is causally tied to the kaval death — the ECONNRESET/ECANCELED both originate on the pty-host link and fire in the same ~30 ms the kaval socket closes.

Root cause: live terminals accumulate, each pinning a 50 K-line mirror

Reproduced and confirmed by driving createPtyHost in isolation on a clean box (naiveintent). kaval keeps, per live PTY, an @xterm/headless screen mirror sized at DEFAULT_SCROLLBACK = 50_000 lines (config.ts, passed at ptyHost/index.ts). The heap is linear in live-terminal count and flat under everything else:

Driver	Heap behaviour
1 terminal, unbounded `yes` output, 90 s	oscillates 14–64 MB — bounded (scrollback caps it)
attach + abort a subscription, tight loop	flat — bounded (abort cleanup works)
spawn + write + kill, 8 000×	flat — bounded (`teardown` is complete)
spawn terminals, never kill	linear — ~18 MB V8 heap/terminal at the production 50 K (+ ~44 MB external cell buffers)

Under a 1 GB old-space cap at the production 50 K scrollback, the host dies with the exact production signature — FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory — at ~54 fully-scrolled terminals, heap climbing linearly 358 → 970 MB. The abort is an old-space heap event, so the driver is ~18 MB of V8 heap per terminal (the BufferLine / Object / typed-array wrappers); the cell payloads add ~44 MB of external ArrayBuffer memory each — real RSS pressure, but not counted by the heap limit that aborts. A one-terminal snapshot at 10 K scrollback shows ~10 000 each of ArrayBuffer / Uint32Array / xterm BufferLine — the scrollback grid, nothing else — scaling ~5× at the production 50 K.

So the operative answer: heap is proportional to live terminal count (× each terminal’s scrollback fill). A single busy terminal, or terminal/subscription churn, is bounded — activity alone doesn’t grow it. The default ~4 GB old-space ceiling is reached at a couple hundred fully-scrolled terminals (≈ 4 GB ÷ ~18 MB ≈ 220; fewer in practice, summed across partially-filled ones).

Why the count grows without bound: reconcile never reaps — a surviving kaval’s live PTYs are all adopted across every server restart (the “terminals survive a kolu update” guarantee), and a terminal is freed only when its child process exits or the user explicitly kills it (reconcile.ts). Long-lived shells and agents — across many worktrees, over days, ratcheted up by each crash-restart — accumulate, each pinning a 50 K-line mirror. Not a teardown bug (teardown is clean): an unbounded, never-reaped population × a large per-terminal retainer.

Recurrence is real — same signature on 2026-05-27, 06-13 (825 MB), 06-15 (632 MB), 06-16 (809 MB), 06-19 (732 MB), ~every 2–6 days (coredumpctl list). Red herrings now ruled out: a ~60-RPC attach burst 15 s pre-crash (a single WebSocket disconnect aborting in-flight RPCs); the bounded structures (exit tombstones ≤ 1024, per-subscriber queues ≤ 10 K, exit-waiters) — none leak.

Does this hurt kaval’s performance? — yes, before it ever crashes

This is the part that bites day-to-day, not just at the moment of death. kaval is the hot path for all terminal I/O — every byte in and out of every PTY crosses its event loop. As the heap creeps toward 4 GB, V8’s GC runs more often and for longer, ending in the “ineffective mark-compacts” the crash banner names: long, stop-the-world pauses that are the leak’s tail. While GC holds the loop, kaval can’t relay output, echo keystrokes, or deliver exit signals — so terminals feel progressively laggier the longer the server has been up, worst in the hours before the OOM, then snap back to crisp after the restart-induced fresh heap. So the leak has two costs: a hard crash every few days, and a soft “kolu gets sluggish over time” that a restart silently papers over. (The leak floor that @kolu/heap-diag is tuned for is ~10 MB/min.)

Why we can’t see it yet — the diagnostics gap

kolu already has heap diagnostics: set KOLU_DIAG_DIR and the server writes a baseline snapshot, logs subsystem sizes every 5 min, and arms --heapsnapshot-near-heap-limit=3 so V8 dumps a snapshot just before an OOM (the shared @kolu/heap-diag receptacle #1427 extracted, heap-diag). The home-manager module already exposes it as services.kolu.diagnostics.dir (module.nix).

But it only instruments the server (Node 22) — the process that doesn’t crash. kaval (Node 24, the one that does) is deliberately excluded:

The leak was named without it (the in-process repro on a clean box did the job), but the gap mattered: with kaval instrumented, prod would have shown the terms-count curve climbing for days, and a near-limit snapshot would confirm the same scrollback grid in the real workload. Now closed in #1427 — localDriver forwards KOLU_DIAG_DIR and kaval’s wrapper arms its own near-limit snapshot under a kaval-private subdir, so the next approach to the (now bounded) ceiling dumps a snapshot instead of a silent abort.

How other terminals & multiplexers bound this

A sweep of the field (ghostty, tmux, zellij, kitty, wezterm, GNU screen, mosh, VTE, zmosh). It’s near-unanimous, and kaval is the outlier — a 50 K-line server-side mirror is 5–50× everyone else:

System	Default history	Bounding	Reattach restores
kaval (today)	50 000 lines / PTY	none — no cap, no reap	full mirror, eagerly serialized
GNU screen	100 lines	line cap	viewport (RAM)
xterm.js / headless	1 000 rows (library default — kolu overrides to 50 K)	ring buffer	ANSI snapshot (`SerializeAddon`)
tmux	2 000 lines / pane	line cap + batch trim	viewport; scrollback lazily in copy-mode
kitty	2 000 in-RAM	overflow spills to a temp file → pager	n/a (emulator)
wezterm-mux	3 500 lines	line cap (uncompressed)	viewport; lazily by range (`GetLines` RPC)
zellij	10 000 lines	ring + serialize panes to disk	cold: from disk, not RAM
ghostty	10 MB (bytes)	byte cap + page-trim, ~12.5 B/cell	n/a (emulator)
zmosh	ghostty-vt byte cap	ring evict	serialize VT (scrollback + viewport)
mosh	0 — viewport only	keeps no history at all	the live screen, ever
VTE (GNOME)	“infinite”	LZ4 + AES disk ring, near-0 resident (hot pages only)	n/a (emulator)

Three lessons that reshape the fix:

Nobody holds deep history as live cell-objects — it’s a small ring (tmux 2 K) or pushed off the hot path to disk (zellij / kitty / VTE, the last LZ4-compressing its disk ring). kaval’s 50 K of live BufferLine objects is the anomaly. (In V8 the killer is object-header + GC pressure, not raw bytes — so deep history wants to be a compressed Buffer or a file, never live xterm lines.)
No multiplexer eagerly serializes the whole mirror on reattach — tmux / wezterm repaint the viewport and stream older lines lazily, by range, on scroll. kaval’s attach() → full-buffer serialize() is exactly the avoidable cost.
Cap by bytes, not lines — a wide blank line still costs, and agent streaming is the real-world OOM driver elsewhere too (tmux #4859 ≈ 48 GB, ghostty 37 GB), not deep interactive history.

The fix — a small hot mirror over an on-disk transcript log

The 50 K was never for scrolling. #416 bumped it from 10 K for PDF export (#413) — so a naive shrink would regress export. And the real fix is already designed: #417 · server-side transcript log, the on-disk source of truth that #416 explicitly called itself a bandaid for. This RCA promotes #417 from a features ticket to the memory fix — and corrects its non-goal #1 (“the ~4 KB attach snapshot is already optimal”): attach() serializes the whole buffer with no scrollback limit (ptyHost.ts:571), so that path is the cost, not a constant.

The plan of record — which is also just what the field does (small hot buffer + deep history off the hot path + lazy backfill):

Target shape. Every PTY byte feeds a small live mirror (viewport + just enough for metadata, scrape, cold repaint — capped by bytes) and appends to an on-disk transcript log (#417). Clients attach against the small mirror; deep scroll-back, PDF export, and search lazily backfill from the log — no live 50 K cell-grid retained per terminal, anywhere.

1 Keep the in-RAM mirror small & constant

Size the headless mirror to what the live jobs actually need — viewport + a small window for the metadata OSC handlers, device-query replies, screen-scrape tail, and cold-attach repaint — capped by bytes, not 50 K lines. Shells are never reaped — the survivability guarantee is untouched; a small mirror makes an idle terminal nearly free, which also dissolves the “should we reap idle terminals?” tension.

Measured, not extrapolated — re-running the accumulation repro under a fixed 512 MB old-space cap at three mirror sizes:

Mirror scrollback	V8 heap / terminal	Terms → OOM (512 MB cap)	→ at the 4 GB ceiling
50 K (today)	~16 MB	32	~256 terminals
10 K	~3.9 MB	130 (4.1×)	~1,050
2 K	~1.0 MB	≥396 — no OOM (≥12×)	~4,000

A ~2 K mirror turns a ~256-terminal ceiling into ~4,000 (and cuts external ArrayBuffer RSS ~15× too). The gain is sub-linear — a 25× line cut buys ~16×, not 25× — because a fixed ~0.5–1 MB/terminal floor (node-pty handle, Entry, channels, the Terminal instance) doesn’t shrink. Two honest bounds: the cap only helps terminals that exceed it (a 500-line terminal is unaffected — but the deep-scrollback agent terminals that do drive the OOM benefit fully), and it raises the ceiling rather than removing the linear-in-count growth (that’s what #2/#3 + reaping are for).

2 Deep history → the on-disk transcript log (#417)

Append every PTY byte to a per-terminal log on disk — raw bytes, the honest replayable source (rendered/serialized state is lossy). PDF export (the reason 50 K exists), scrollback search, true session restore, and crash forensics all read from it; depth is bounded by a disk retention policy, not RAM. This is #417 as already specified — reuse it, don’t build a parallel store.

3 Lazy backfill on deep scroll

Stop eager-serializing the mirror on attach. Repaint the viewport from the small mirror, then when a client scrolls past the hot window, fetch older ranges from the log and render them (the tmux / wezterm pattern). Cold reconnect becomes cheap and lossless.

now Interim, until #417 lands

#417 is a multi-PR effort; ship the observability net first so a future leak is diagnosable, not a mystery:

kaval-side diagnostics — un-scrub the snapshot flags (or pass explicit execArgv + diag dir via localKavalDriver) so prod shows the terms/heap curve and dumps a near-limit snapshot.
A soak-test regression guard — assert per-terminal heap stays proportional and bounded, so a scrollback-size or accumulation regression trips the test, not prod.

Current state & open questions

Fixed in #1427. The recurrence risk that drove this RCA is now bounded, not just identified. The change that matters: the server-side mirror shrank from 50 K to a 10 K DEFAULT_MIRROR_SCROLLBACK (decoupled from the client’s 50 K, which PDF export + interactive scrollback still need), measured ~4× the OOM ceiling. Alongside it, an observability net — kaval-side @kolu/heap-diag that logs the heap/terms curve and arms a near-limit snapshot, so the next approach to the ceiling dumps a snapshot instead of a silent abort. (No explicit heap cap: with the mirror fixed, a cap would only give back the headroom the fix bought — see the note above.)

This raises the ceiling ~4×; the linear-in-count growth remains by design — removing it is the tracked follow-up, not a regression. What remains open:

The on-disk transcript log (#417) + lazy backfill — the real fix that removes the linear-in-count growth and lets the hot mirror shrink further (toward a byte-capped viewport window). #417 also carries its own retention policy (per-terminal disk-log size cap + a privacy off-switch — a legitimate switch, not a degradation knob).
The exact JS path of the kolu-side write ECANCELED — inferred (a floating promise on the pty-host stdio link), not pinned to a verified line. Low priority: the kolu exit is correct behaviour regardless.

Reproduced in-process on naiveintent (heap linear in live-terminal count; the production crash signature at the 50 K mirror). The plan landed in #1421; the fix — small mirror (10 K) + kaval diagnostics — shipped in #1427. The deeper follow-up that removes the linear-in-count growth is #417 (with #416 / #413 as the why-50K backstory).