← the Atlas

Kaval PTY-host heap OOM

Analysis·budding·implemented·

A recurring production crash — kaval's per-PTY 50 K-line scrollback mirror × an unbounded, never-reaped live-terminal population grows the V8 heap to its ~4 GB ceiling; it self-aborts (SIGABRT) and takes the kolu server down via fail-fast. Reproduced in-process. RCA + prior art (tmux/zellij/ghostty/kitty/mosh) + plan of record — a small hot mirror over an on-disk transcript log (#417).

Production postmortem — pureintent, the always-on kolu.service. Latest crash 2026-06-19 14:26; the fifth with an identical signature since 2026-05-27. Surfaced by a “why did kolu restart?” investigation.

KOLU_DIAG_DIR — diagnostics reachhome-manager · services.koludiagnostics.dir → KOLU_DIAG_DIR (server only)kolu server — Node 22.22.1diagnostics.ts → heap snapshots (server heap) ✓localDriver scrubNodeOptions() ✂ strips –heapsnapshot@kolu/surface-daemon-supervisor → systemd-run –userkaval — PTY host · Node 24.13.0OOM SITE — runs BLIND: no heap limit, no diagnosticsExecStart + envspawn — diag flags severed herespawn (transient unit)pty-host.sock
Where the leak lives, and why kaval ran blind: the home-manager diagnostics option reaches the server only; localDriver strips the heap-snapshot flag from NODE_OPTIONS before kaval — the one process that actually OOMs.

What crashed

Two processes, one cascade. The kaval PTY host — the daemon that holds the live PTY file descriptors for every terminal (kaval/src/bin.ts) — exhausted its V8 JavaScript heap and self-aborted. Losing its only PTY host, the kolu server then fail-fast-exited, and systemd restarted it.

Time (EDT) Event Evidence
14:25:49 kaval GC pinned at the ceiling: Mark-Compact (reduce) 4083.2 → 4058.0 MB, then FATAL ERROR: Ineffective mark-compacts near heap limit — JavaScript heap out of memory kaval unit journal
14:25:49 kaval aborts: node::OOMErrorHandler → abort → raise coredump stack, thread 329839
14:26:08 Coredump captured — 732 MB compressed (Signal 6 / ABRT); kaval socket closes coredumpctl info 329839
14:26:09.04 kolu server sees [@kolu/surface/links/stdio] outbound write error: read ECONNRESET, then a storm of pty-host tap subscription failed / terminal.spawn failed kolu.service journal
14:26:09.07 kolu FATAL … uncaught exceptionError: write ECANCELED kolu.service journal
14:26:09 systemd: Main process exited, status=1/FAILUREFailed with result 'exit-code'Scheduled restart job, restart counter is at 1 systemd[1274]
14:26:09 New server (serverId 1010bc7d) up with a fresh kaval on the same socket — healthy systemctl --user show (NRestarts=1)

This was not a kernel OOM-kill (signal is 6/ABRT, a userspace self-abort — not 9/SIGKILL; journalctl -k had zero OOM lines; the unit’s MemoryMax=infinity), not a deploy (status=1/FAILURE + scheduled restart is a crash, not a clean stop/start), and not disk-full (/ at 78 %, 190 GB free). The kolu exit is causally tied to the kaval death — the ECONNRESET/ECANCELED both originate on the pty-host link and fire in the same ~30 ms the kaval socket closes.

Root cause: live terminals accumulate, each pinning a 50 K-line mirror

Reproduced and confirmed by driving createPtyHost in isolation on a clean box (naiveintent). kaval keeps, per live PTY, an @xterm/headless screen mirror sized at DEFAULT_SCROLLBACK = 50_000 lines (config.ts, passed at ptyHost/index.ts). The heap is linear in live-terminal count and flat under everything else:

Driver Heap behaviour
1 terminal, unbounded yes output, 90 s oscillates 14–64 MB — bounded (scrollback caps it)
attach + abort a subscription, tight loop flat — bounded (abort cleanup works)
spawn + write + kill, 8 000× flat — bounded (teardown is complete)
spawn terminals, never kill linear — ~18 MB V8 heap/terminal at the production 50 K (+ ~44 MB external cell buffers)

Under a 1 GB old-space cap at the production 50 K scrollback, the host dies with the exact production signatureFATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory — at ~54 fully-scrolled terminals, heap climbing linearly 358 → 970 MB. The abort is an old-space heap event, so the driver is ~18 MB of V8 heap per terminal (the BufferLine / Object / typed-array wrappers); the cell payloads add ~44 MB of external ArrayBuffer memory each — real RSS pressure, but not counted by the heap limit that aborts. A one-terminal snapshot at 10 K scrollback shows ~10 000 each of ArrayBuffer / Uint32Array / xterm BufferLine — the scrollback grid, nothing else — scaling ~5× at the production 50 K.

So the operative answer: heap is proportional to live terminal count (× each terminal’s scrollback fill). A single busy terminal, or terminal/subscription churn, is bounded — activity alone doesn’t grow it. The default ~4 GB old-space ceiling is reached at a couple hundred fully-scrolled terminals (≈ 4 GB ÷ ~18 MB ≈ 220; fewer in practice, summed across partially-filled ones).

Why the count grows without bound: reconcile never reaps — a surviving kaval’s live PTYs are all adopted across every server restart (the “terminals survive a kolu update” guarantee), and a terminal is freed only when its child process exits or the user explicitly kills it (reconcile.ts). Long-lived shells and agents — across many worktrees, over days, ratcheted up by each crash-restart — accumulate, each pinning a 50 K-line mirror. Not a teardown bug (teardown is clean): an unbounded, never-reaped population × a large per-terminal retainer.

Recurrence is real — same signature on 2026-05-27, 06-13 (825 MB), 06-15 (632 MB), 06-16 (809 MB), 06-19 (732 MB), ~every 2–6 days (coredumpctl list). Red herrings now ruled out: a ~60-RPC attach burst 15 s pre-crash (a single WebSocket disconnect aborting in-flight RPCs); the bounded structures (exit tombstones ≤ 1024, per-subscriber queues ≤ 10 K, exit-waiters) — none leak.

Does this hurt kaval’s performance? — yes, before it ever crashes

This is the part that bites day-to-day, not just at the moment of death. kaval is the hot path for all terminal I/O — every byte in and out of every PTY crosses its event loop. As the heap creeps toward 4 GB, V8’s GC runs more often and for longer, ending in the “ineffective mark-compacts” the crash banner names: long, stop-the-world pauses that are the leak’s tail. While GC holds the loop, kaval can’t relay output, echo keystrokes, or deliver exit signals — so terminals feel progressively laggier the longer the server has been up, worst in the hours before the OOM, then snap back to crisp after the restart-induced fresh heap. So the leak has two costs: a hard crash every few days, and a soft “kolu gets sluggish over time” that a restart silently papers over. (The leak floor that @kolu/heap-diag is tuned for is ~10 MB/min.)

Why we can’t see it yet — the diagnostics gap

kolu already has heap diagnostics: set KOLU_DIAG_DIR and the server writes a baseline snapshot, logs subsystem sizes every 5 min, and arms --heapsnapshot-near-heap-limit=3 so V8 dumps a snapshot just before an OOM (the shared @kolu/heap-diag receptacle #1427 extracted, heap-diag). The home-manager module already exposes it as services.kolu.diagnostics.dir (module.nix).

But it only instruments the server (Node 22) — the process that doesn’t crash. kaval (Node 24, the one that does) is deliberately excluded:

The leak was named without it (the in-process repro on a clean box did the job), but the gap mattered: with kaval instrumented, prod would have shown the terms-count curve climbing for days, and a near-limit snapshot would confirm the same scrollback grid in the real workload. Now closed in #1427localDriver forwards KOLU_DIAG_DIR and kaval’s wrapper arms its own near-limit snapshot under a kaval-private subdir, so the next approach to the (now bounded) ceiling dumps a snapshot instead of a silent abort.

How other terminals & multiplexers bound this

A sweep of the field (ghostty, tmux, zellij, kitty, wezterm, GNU screen, mosh, VTE, zmosh). It’s near-unanimous, and kaval is the outlier — a 50 K-line server-side mirror is 5–50× everyone else:

System Default history Bounding Reattach restores
kaval (today) 50 000 lines / PTY none — no cap, no reap full mirror, eagerly serialized
GNU screen 100 lines line cap viewport (RAM)
xterm.js / headless 1 000 rows (library default — kolu overrides to 50 K) ring buffer ANSI snapshot (SerializeAddon)
tmux 2 000 lines / pane line cap + batch trim viewport; scrollback lazily in copy-mode
kitty 2 000 in-RAM overflow spills to a temp file → pager n/a (emulator)
wezterm-mux 3 500 lines line cap (uncompressed) viewport; lazily by range (GetLines RPC)
zellij 10 000 lines ring + serialize panes to disk cold: from disk, not RAM
ghostty 10 MB (bytes) byte cap + page-trim, ~12.5 B/cell n/a (emulator)
zmosh ghostty-vt byte cap ring evict serialize VT (scrollback + viewport)
mosh 0 — viewport only keeps no history at all the live screen, ever
VTE (GNOME) “infinite” LZ4 + AES disk ring, near-0 resident (hot pages only) n/a (emulator)

Three lessons that reshape the fix:

  1. Nobody holds deep history as live cell-objects — it’s a small ring (tmux 2 K) or pushed off the hot path to disk (zellij / kitty / VTE, the last LZ4-compressing its disk ring). kaval’s 50 K of live BufferLine objects is the anomaly. (In V8 the killer is object-header + GC pressure, not raw bytes — so deep history wants to be a compressed Buffer or a file, never live xterm lines.)
  2. No multiplexer eagerly serializes the whole mirror on reattach — tmux / wezterm repaint the viewport and stream older lines lazily, by range, on scroll. kaval’s attach() → full-buffer serialize() is exactly the avoidable cost.
  3. Cap by bytes, not lines — a wide blank line still costs, and agent streaming is the real-world OOM driver elsewhere too (tmux #4859 ≈ 48 GB, ghostty 37 GB), not deep interactive history.

The fix — a small hot mirror over an on-disk transcript log

The 50 K was never for scrolling. #416 bumped it from 10 K for PDF export (#413) — so a naive shrink would regress export. And the real fix is already designed: #417 · server-side transcript log, the on-disk source of truth that #416 explicitly called itself a bandaid for. This RCA promotes #417 from a features ticket to the memory fix — and corrects its non-goal #1 (“the ~4 KB attach snapshot is already optimal”): attach() serializes the whole buffer with no scrollback limit (ptyHost.ts:571), so that path is the cost, not a constant.

The plan of record — which is also just what the field does (small hot buffer + deep history off the hot path + lazy backfill):

PTY child (node-pty)raw bytesHOT PATH · in RAMmirror — SMALLbyte-capped, not 50 Kviewport · metadata · scrape · repaintCOLD STORE · on disktranscript log (#417)raw PTY bytes · append-onlyretention-capped (disk, deep store)browser xtermkeeps its own visible scrollbackparse (live)append every byteattach: viewport + deltas(hot path)lazy backfill(cold: scroll · PDF · search)
Target shape. Every PTY byte feeds a small live mirror (viewport + just enough for metadata, scrape, cold repaint — capped by bytes) and appends to an on-disk transcript log (#417). Clients attach against the small mirror; deep scroll-back, PDF export, and search lazily backfill from the log — no live 50 K cell-grid retained per terminal, anywhere.

1 Keep the in-RAM mirror small & constant

Size the headless mirror to what the live jobs actually need — viewport + a small window for the metadata OSC handlers, device-query replies, screen-scrape tail, and cold-attach repaint — capped by bytes, not 50 K lines. Shells are never reaped — the survivability guarantee is untouched; a small mirror makes an idle terminal nearly free, which also dissolves the “should we reap idle terminals?” tension.

Measured, not extrapolated — re-running the accumulation repro under a fixed 512 MB old-space cap at three mirror sizes:

Mirror scrollback V8 heap / terminal Terms → OOM (512 MB cap) → at the 4 GB ceiling
50 K (today) ~16 MB 32 ~256 terminals
10 K ~3.9 MB 130 (4.1×) ~1,050
2 K ~1.0 MB ≥396 — no OOM (≥12×) ~4,000

A ~2 K mirror turns a ~256-terminal ceiling into ~4,000 (and cuts external ArrayBuffer RSS ~15× too). The gain is sub-linear — a 25× line cut buys ~16×, not 25× — because a fixed ~0.5–1 MB/terminal floor (node-pty handle, Entry, channels, the Terminal instance) doesn’t shrink. Two honest bounds: the cap only helps terminals that exceed it (a 500-line terminal is unaffected — but the deep-scrollback agent terminals that do drive the OOM benefit fully), and it raises the ceiling rather than removing the linear-in-count growth (that’s what #2/#3 + reaping are for).

2 Deep history → the on-disk transcript log (#417)

Append every PTY byte to a per-terminal log on disk — raw bytes, the honest replayable source (rendered/serialized state is lossy). PDF export (the reason 50 K exists), scrollback search, true session restore, and crash forensics all read from it; depth is bounded by a disk retention policy, not RAM. This is #417 as already specified — reuse it, don’t build a parallel store.

3 Lazy backfill on deep scroll

Stop eager-serializing the mirror on attach. Repaint the viewport from the small mirror, then when a client scrolls past the hot window, fetch older ranges from the log and render them (the tmux / wezterm pattern). Cold reconnect becomes cheap and lossless.

now Interim, until #417 lands

#417 is a multi-PR effort; ship the observability net first so a future leak is diagnosable, not a mystery:

Current state & open questions

Fixed in #1427. The recurrence risk that drove this RCA is now bounded, not just identified. The change that matters: the server-side mirror shrank from 50 K to a 10 K DEFAULT_MIRROR_SCROLLBACK (decoupled from the client’s 50 K, which PDF export + interactive scrollback still need), measured ~4× the OOM ceiling. Alongside it, an observability net — kaval-side @kolu/heap-diag that logs the heap/terms curve and arms a near-limit snapshot, so the next approach to the ceiling dumps a snapshot instead of a silent abort. (No explicit heap cap: with the mirror fixed, a cap would only give back the headroom the fix bought — see the note above.)

This raises the ceiling ~4×; the linear-in-count growth remains by design — removing it is the tracked follow-up, not a regression. What remains open:

Reproduced in-process on naiveintent (heap linear in live-terminal count; the production crash signature at the 50 K mirror). The plan landed in #1421; the fix — small mirror (10 K) + kaval diagnostics — shipped in #1427. The deeper follow-up that removes the linear-in-count growth is #417 (with #416 / #413 as the why-50K backstory).