Kaval PTY-host heap OOM
A recurring production crash — kaval's per-PTY 50 K-line scrollback mirror × an unbounded, never-reaped live-terminal population grows the V8 heap to its ~4 GB ceiling; it self-aborts (SIGABRT) and takes the kolu server down via fail-fast. Reproduced in-process. RCA + prior art (tmux/zellij/ghostty/kitty/mosh) + plan of record — a small hot mirror over an on-disk transcript log (#417).
Production postmortem — pureintent, the always-on kolu.service. Latest crash 2026-06-19 14:26; the fifth with an identical signature since 2026-05-27. Surfaced by a “why did kolu restart?” investigation.
What crashed
Two processes, one cascade. The kaval PTY host — the daemon that holds the live PTY file descriptors for every terminal (kaval/src/bin.ts) — exhausted its V8 JavaScript heap and self-aborted. Losing its only PTY host, the kolu server then fail-fast-exited, and systemd restarted it.
| Time (EDT) | Event | Evidence |
|---|---|---|
| 14:25:49 | kaval GC pinned at the ceiling: Mark-Compact (reduce) 4083.2 → 4058.0 MB, then FATAL ERROR: Ineffective mark-compacts near heap limit — JavaScript heap out of memory |
kaval unit journal |
| 14:25:49 | kaval aborts: node::OOMErrorHandler → abort → raise |
coredump stack, thread 329839 |
| 14:26:08 | Coredump captured — 732 MB compressed (Signal 6 / ABRT); kaval socket closes |
coredumpctl info 329839 |
| 14:26:09.04 | kolu server sees [@kolu/surface/links/stdio] outbound write error: read ECONNRESET, then a storm of pty-host tap subscription failed / terminal.spawn failed |
kolu.service journal |
| 14:26:09.07 | kolu FATAL … uncaught exception → Error: write ECANCELED |
kolu.service journal |
| 14:26:09 | systemd: Main process exited, status=1/FAILURE → Failed with result 'exit-code' → Scheduled restart job, restart counter is at 1 |
systemd[1274] |
| 14:26:09 | New server (serverId 1010bc7d) up with a fresh kaval on the same socket — healthy |
systemctl --user show (NRestarts=1) |
This was not a kernel OOM-kill (signal is 6/ABRT, a userspace self-abort — not 9/SIGKILL; journalctl -k had zero OOM lines; the unit’s MemoryMax=infinity), not a deploy (status=1/FAILURE + scheduled restart is a crash, not a clean stop/start), and not disk-full (/ at 78 %, 190 GB free). The kolu exit is causally tied to the kaval death — the ECONNRESET/ECANCELED both originate on the pty-host link and fire in the same ~30 ms the kaval socket closes.
Root cause: live terminals accumulate, each pinning a 50 K-line mirror
Reproduced and confirmed by driving createPtyHost in isolation on a clean box (naiveintent). kaval keeps, per live PTY, an @xterm/headless screen mirror sized at DEFAULT_SCROLLBACK = 50_000 lines (config.ts, passed at ptyHost/index.ts). The heap is linear in live-terminal count and flat under everything else:
| Driver | Heap behaviour |
|---|---|
1 terminal, unbounded yes output, 90 s |
oscillates 14–64 MB — bounded (scrollback caps it) |
| attach + abort a subscription, tight loop | flat — bounded (abort cleanup works) |
| spawn + write + kill, 8 000× | flat — bounded (teardown is complete) |
| spawn terminals, never kill | linear — ~18 MB V8 heap/terminal at the production 50 K (+ ~44 MB external cell buffers) |
Under a 1 GB old-space cap at the production 50 K scrollback, the host dies with the exact production signature — FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory — at ~54 fully-scrolled terminals, heap climbing linearly 358 → 970 MB. The abort is an old-space heap event, so the driver is ~18 MB of V8 heap per terminal (the BufferLine / Object / typed-array wrappers); the cell payloads add ~44 MB of external ArrayBuffer memory each — real RSS pressure, but not counted by the heap limit that aborts. A one-terminal snapshot at 10 K scrollback shows ~10 000 each of ArrayBuffer / Uint32Array / xterm BufferLine — the scrollback grid, nothing else — scaling ~5× at the production 50 K.
So the operative answer: heap is proportional to live terminal count (× each terminal’s scrollback fill). A single busy terminal, or terminal/subscription churn, is bounded — activity alone doesn’t grow it. The default ~4 GB old-space ceiling is reached at a couple hundred fully-scrolled terminals (≈ 4 GB ÷ ~18 MB ≈ 220; fewer in practice, summed across partially-filled ones).
Why the count grows without bound: reconcile never reaps — a surviving kaval’s live PTYs are all adopted across every server restart (the “terminals survive a kolu update” guarantee), and a terminal is freed only when its child process exits or the user explicitly kills it (reconcile.ts). Long-lived shells and agents — across many worktrees, over days, ratcheted up by each crash-restart — accumulate, each pinning a 50 K-line mirror. Not a teardown bug (teardown is clean): an unbounded, never-reaped population × a large per-terminal retainer.
Recurrence is real — same signature on 2026-05-27, 06-13 (825 MB), 06-15 (632 MB), 06-16 (809 MB), 06-19 (732 MB), ~every 2–6 days (coredumpctl list). Red herrings now ruled out: a ~60-RPC attach burst 15 s pre-crash (a single WebSocket disconnect aborting in-flight RPCs); the bounded structures (exit tombstones ≤ 1024, per-subscriber queues ≤ 10 K, exit-waiters) — none leak.
Does this hurt kaval’s performance? — yes, before it ever crashes
This is the part that bites day-to-day, not just at the moment of death. kaval is the hot path for all terminal I/O — every byte in and out of every PTY crosses its event loop. As the heap creeps toward 4 GB, V8’s GC runs more often and for longer, ending in the “ineffective mark-compacts” the crash banner names: long, stop-the-world pauses that are the leak’s tail. While GC holds the loop, kaval can’t relay output, echo keystrokes, or deliver exit signals — so terminals feel progressively laggier the longer the server has been up, worst in the hours before the OOM, then snap back to crisp after the restart-induced fresh heap. So the leak has two costs: a hard crash every few days, and a soft “kolu gets sluggish over time” that a restart silently papers over. (The leak floor that @kolu/heap-diag is tuned for is ~10 MB/min.)
Why we can’t see it yet — the diagnostics gap
kolu already has heap diagnostics: set KOLU_DIAG_DIR and the server writes a baseline snapshot, logs subsystem sizes every 5 min, and arms --heapsnapshot-near-heap-limit=3 so V8 dumps a snapshot just before an OOM (the shared @kolu/heap-diag receptacle #1427 extracted, heap-diag). The home-manager module already exposes it as services.kolu.diagnostics.dir (module.nix).
But it only instruments the server (Node 22) — the process that doesn’t crash. kaval (Node 24, the one that does) is deliberately excluded:
The leak was named without it (the in-process repro on a clean box did the job), but the gap mattered: with kaval instrumented, prod would have shown the terms-count curve climbing for days, and a near-limit snapshot would confirm the same scrollback grid in the real workload. Now closed in #1427 — localDriver forwards KOLU_DIAG_DIR and kaval’s wrapper arms its own near-limit snapshot under a kaval-private subdir, so the next approach to the (now bounded) ceiling dumps a snapshot instead of a silent abort.
How other terminals & multiplexers bound this
A sweep of the field (ghostty, tmux, zellij, kitty, wezterm, GNU screen, mosh, VTE, zmosh). It’s near-unanimous, and kaval is the outlier — a 50 K-line server-side mirror is 5–50× everyone else:
| System | Default history | Bounding | Reattach restores |
|---|---|---|---|
| kaval (today) | 50 000 lines / PTY | none — no cap, no reap | full mirror, eagerly serialized |
| GNU screen | 100 lines | line cap | viewport (RAM) |
| xterm.js / headless | 1 000 rows (library default — kolu overrides to 50 K) | ring buffer | ANSI snapshot (SerializeAddon) |
| tmux | 2 000 lines / pane | line cap + batch trim | viewport; scrollback lazily in copy-mode |
| kitty | 2 000 in-RAM | overflow spills to a temp file → pager | n/a (emulator) |
| wezterm-mux | 3 500 lines | line cap (uncompressed) | viewport; lazily by range (GetLines RPC) |
| zellij | 10 000 lines | ring + serialize panes to disk | cold: from disk, not RAM |
| ghostty | 10 MB (bytes) | byte cap + page-trim, ~12.5 B/cell | n/a (emulator) |
| zmosh | ghostty-vt byte cap | ring evict | serialize VT (scrollback + viewport) |
| mosh | 0 — viewport only | keeps no history at all | the live screen, ever |
| VTE (GNOME) | “infinite” | LZ4 + AES disk ring, near-0 resident (hot pages only) | n/a (emulator) |
Three lessons that reshape the fix:
- Nobody holds deep history as live cell-objects — it’s a small ring (tmux 2 K) or pushed off the hot path to disk (zellij / kitty / VTE, the last LZ4-compressing its disk ring). kaval’s 50 K of live
BufferLineobjects is the anomaly. (In V8 the killer is object-header + GC pressure, not raw bytes — so deep history wants to be a compressedBufferor a file, never live xterm lines.) - No multiplexer eagerly serializes the whole mirror on reattach — tmux / wezterm repaint the viewport and stream older lines lazily, by range, on scroll. kaval’s
attach()→ full-bufferserialize()is exactly the avoidable cost. - Cap by bytes, not lines — a wide blank line still costs, and agent streaming is the real-world OOM driver elsewhere too (tmux #4859 ≈ 48 GB, ghostty 37 GB), not deep interactive history.
The fix — a small hot mirror over an on-disk transcript log
The 50 K was never for scrolling. #416 bumped it from 10 K for PDF export (#413) — so a naive shrink would regress export. And the real fix is already designed: #417 · server-side transcript log, the on-disk source of truth that #416 explicitly called itself a bandaid for. This RCA promotes #417 from a features ticket to the memory fix — and corrects its non-goal #1 (“the ~4 KB attach snapshot is already optimal”): attach() serializes the whole buffer with no scrollback limit (ptyHost.ts:571), so that path is the cost, not a constant.
The plan of record — which is also just what the field does (small hot buffer + deep history off the hot path + lazy backfill):
1 Keep the in-RAM mirror small & constant
Size the headless mirror to what the live jobs actually need — viewport + a small window for the metadata OSC handlers, device-query replies, screen-scrape tail, and cold-attach repaint — capped by bytes, not 50 K lines. Shells are never reaped — the survivability guarantee is untouched; a small mirror makes an idle terminal nearly free, which also dissolves the “should we reap idle terminals?” tension.
Measured, not extrapolated — re-running the accumulation repro under a fixed 512 MB old-space cap at three mirror sizes:
| Mirror scrollback | V8 heap / terminal | Terms → OOM (512 MB cap) | → at the 4 GB ceiling |
|---|---|---|---|
| 50 K (today) | ~16 MB | 32 | ~256 terminals |
| 10 K | ~3.9 MB | 130 (4.1×) | ~1,050 |
| 2 K | ~1.0 MB | ≥396 — no OOM (≥12×) | ~4,000 |
A ~2 K mirror turns a ~256-terminal ceiling into ~4,000 (and cuts external ArrayBuffer RSS ~15× too). The gain is sub-linear — a 25× line cut buys ~16×, not 25× — because a fixed ~0.5–1 MB/terminal floor (node-pty handle, Entry, channels, the Terminal instance) doesn’t shrink. Two honest bounds: the cap only helps terminals that exceed it (a 500-line terminal is unaffected — but the deep-scrollback agent terminals that do drive the OOM benefit fully), and it raises the ceiling rather than removing the linear-in-count growth (that’s what #2/#3 + reaping are for).
2 Deep history → the on-disk transcript log (#417)
Append every PTY byte to a per-terminal log on disk — raw bytes, the honest replayable source (rendered/serialized state is lossy). PDF export (the reason 50 K exists), scrollback search, true session restore, and crash forensics all read from it; depth is bounded by a disk retention policy, not RAM. This is #417 as already specified — reuse it, don’t build a parallel store.
3 Lazy backfill on deep scroll
Stop eager-serializing the mirror on attach. Repaint the viewport from the small mirror, then when a client scrolls past the hot window, fetch older ranges from the log and render them (the tmux / wezterm pattern). Cold reconnect becomes cheap and lossless.
now Interim, until #417 lands
#417 is a multi-PR effort; ship the observability net first so a future leak is diagnosable, not a mystery:
- kaval-side diagnostics — un-scrub the snapshot flags (or pass explicit
execArgv+ diag dir vialocalKavalDriver) so prod shows theterms/heap curve and dumps a near-limit snapshot. - A soak-test regression guard — assert per-terminal heap stays proportional and bounded, so a scrollback-size or accumulation regression trips the test, not prod.
Current state & open questions
Fixed in #1427. The recurrence risk that drove this RCA is now bounded, not just identified. The change that matters: the server-side mirror shrank from 50 K to a 10 K DEFAULT_MIRROR_SCROLLBACK (decoupled from the client’s 50 K, which PDF export + interactive scrollback still need), measured ~4× the OOM ceiling. Alongside it, an observability net — kaval-side @kolu/heap-diag that logs the heap/terms curve and arms a near-limit snapshot, so the next approach to the ceiling dumps a snapshot instead of a silent abort. (No explicit heap cap: with the mirror fixed, a cap would only give back the headroom the fix bought — see the note above.)
This raises the ceiling ~4×; the linear-in-count growth remains by design — removing it is the tracked follow-up, not a regression. What remains open:
- The on-disk transcript log (#417) + lazy backfill — the real fix that removes the linear-in-count growth and lets the hot mirror shrink further (toward a byte-capped viewport window). #417 also carries its own retention policy (per-terminal disk-log size cap + a privacy off-switch — a legitimate switch, not a degradation knob).
- The exact JS path of the kolu-side
write ECANCELED— inferred (a floating promise on the pty-host stdio link), not pinned to a verified line. Low priority: the kolu exit is correct behaviour regardless.
Reproduced in-process on naiveintent (heap linear in live-terminal count; the production crash signature at the 50 K mirror). The plan landed in #1421; the fix — small mirror (10 K) + kaval diagnostics — shipped in #1427. The deeper follow-up that removes the linear-in-count growth is #417 (with #416 / #413 as the why-50K backstory).