Flaky Test Tracker
A backlog of flaky tests — e2e (Cucumber + Playwright) and unit (vitest). Drop a row when you hit one; an agent clears the queue from time to time.
A fix-queue for tests that go red on one CI run and green on the next with no code change. See a flake → add a row. An agent works the backlog over time.
Flake vs. break
A flake fails nondeterministically — timing, ordering, or environment — and would pass on a same-SHA rerun; a break fails the same way every run and is a real defect, so file a bug instead. A one-off rerun via the odu MCP is fine to triage which it is — but a rerun is never how a flake gets fixed (see the routine below).
Backlog
Status: open → fixing → fixed (then strike the row).
Queue clear — the eight below were fixed in #1440 (struck = done; the reusable patterns live in Common flake classes).
| Test | Lane | Symptom | Repro’d in | Status | Fix |
|---|---|---|---|---|---|
code-tab.feature:714) |
e2e@aarch64-darwin |
back button never enabled — waitFor 20s timeout |
#1440 | fixed | #1440 |
file-ref-link.feature:69) |
e2e@aarch64-darwin |
app/core never expands — a fresh-open (cold panel) reveal races fsListAll’s first snapshot via a one-shot resolve with no re-yield, so a commit-marker barrier can’t fix it. Mount the tree first; the fresh-open resolve is covered by lineRef.test.ts |
#1440 | fixed | #1440 |
code-tab.feature:148) |
e2e@aarch64-darwin |
tree never hydrated — branch-mode gitStatus stuck on BASE_BRANCH_NOT_FOUND |
#1440 | fixed | #1440 |
render_recovery.feature:16) |
e2e@aarch64-darwin |
AssertionError — screen not repainted on focus regain |
#1440 | fixed | #1440 |
sub-terminal.feature:107) |
e2e@x86_64-linux |
sub-terminal should have keyboard focus — waitForFunction timeout (focus not restored after close) |
#1440 | fixed | #1440 |
canvas.feature:161) |
e2e@aarch64-darwin |
tile-centering pan raced the recorded baseline → canvas transform changed unexpectedly |
#1440 | fixed | #1440 |
claude-code.feature:77) |
e2e@aarch64-darwin |
appended-transcript fs event dropped → task progress 3/5 never showed |
#1440 | fixed | #1440 |
daemon.test.ts) |
unit@x86_64-linux |
waitFor didn’t catch a transient oRPC stream error from the live awareness collection → test threw (vitest, no retry) |
#1440 | fixed | #1440 |
createPtyHost — routes write() to the child and lists live PTYs (kaval/src/ptyHost.test.ts:373) |
unit@aarch64-darwin |
Test timed out in 5000ms — a real-PTY spawn-then-write test stalled past the 5s default on the darwin box; the linux lane passed the same SHA, and the PR touches no packages/kaval file (single-node rerun green) |
#1497 | open | — |
Sub-terminal keeps keyboard focus after close (sub-terminal.feature, sub_terminal_steps.ts:132) |
e2e@aarch64-darwin |
the sub-terminal should have keyboard focus — waitForFunction timeout (focus not restored after close); the same flake fixed for the linux lane in #1440 recurring on the loaded darwin box (472/473 scenarios passed), unrelated to the PR’s surface/pulam changes |
#1497 | open | — |
Clicking a folder ref while already browsing expands it in the live tree (file-ref-link.feature:112) |
e2e@aarch64-darwin |
lib/ui never reaches aria-expanded=true — locator.waitFor 60s timeout across all 3 retries on both runs. A live change into an already-mounted Pierre tree updates the model but never repaints — the unfixed #1534 swallow-emit class (sibling of the fresh-open :69 case fixed in #1440, whose “mount first” fix doesn’t cover the mounted-tree live update). The linux lane passed the same SHA (482/483), and the PR touches no Code-tab/Pierre/file-ref code (all changes are pulam-web + surface), so unrelated to it. Carried by R-pulamweb-4’s vendored @pierre/trees patch. |
#1568 | open | — |
Logging a flake
When a lane goes red and a single-node rerun comes back green, add a row: test
name, recipe@platform lane, the assertion/timeout, the PR it reproduced in
(<PrLink pr={…} />), open. No investigation needed to log it.
Keep the tracker in lock-step with your PR. Log a flake your CI surfaced in the
same PR that hit it — don’t defer to a later cleanup; and a PR that fixes a
flake flips its row to fixed (and strikes it) in that
same PR, regenerating docs/atlas/dist/. The queue only stays trustworthy if every
PR that touches a flake updates this note alongside its own diff.
Fixing routine
An agent clears the backlog by driving CI to N consecutive green runs (N = 5
by default, or as given) through the odu MCP (run the test lanes →
wait_for_settle, repeated). The green streak verifies that a fix is real —
it is never a way to wash a flake out.
Non-negotiable rules:
- A single failure is a fix, not a re-run. One red lane → stop and fix the root cause. Re-running to hope for green is forbidden, and any failure resets the streak to 0.
- Fix the test, not the app. A flake is a defect in the test; change only
packages/tests, never application code — unless a fix is provably impossible without it. - Kill the timing dependence, don’t pad it. A “timing issue” is fixed by waiting on a deterministic signal (a real DOM/aria state, an app event, an awaited promise, correct setup ordering) — never by bumping a timeout or adding a sleep.
- Reuse past de-flake PRs. Read how this suite was de-flaked before — start from Common flake classes below — and apply the established pattern instead of inventing one.
- Code changes pass
/codex-debatebefore the streak counts. Batch the fixes, then drive them through/codex-debateto consensus — a CI pass is only trusted on codex-debate-passed commits (docs are exempt, and the debate is expensive, so debate the batch, not each edit). The debate is read-only code review, so its consensus still needs CI verification: defend an empirically-grounded fix rather than concede a plausible-but-unverified simplification — a consensus that contradicts an observed CI failure is wrong until CI proves otherwise.
The loop: fail → streak resets to 0, root-cause and fix the test
(fixing → fixed, link the PR
with <PrLink pr={…} />); while there, drop any open
row that no longer reproduces. Pass → streak +1. Done = N green runs
back-to-back with the backlog cleared.
Common flake classes
The shapes this suite keeps throwing, and the fix that held — reach for these before inventing one (full detail in each linked PR / the fix’s code comment).
- darwin drops a second fs event. An append or a 2nd create (a transcript
append, a
<pid>.jsonafter its dir) is the event FSEvents/inotify drops under parallel-worker load, so the watcher never re-fires. Fix: write data-then-trigger (the reliable first event carries the payload), ornudgeFilesthe path every poll tick (tests/support/nudge.ts) to re-fire the watch. (claude-code.feature:77) - A one-shot resolve against a stream’s first snapshot can’t recover. If a
consumer resolves once on the first
!pending()frame and the stream won’t re-yield (state already settled), a stale first snapshot is permanent — no marker barrier saves it. Fix: warm / mount the source first so it has enumerated before the action. (file-ref-link.feature:69) - A baseline recorded mid-settle drifts. Recording a transform/value while an
animation or a sensor re-resolve is in flight captures a moving target. Fix:
settle-before-baseline — wait on the steady-state signal (tile centered, git
settled to the repo, tree row enumerated) before recording or asserting.
(
canvas.feature:161,code-tab.feature:714) - A passive reader observes a half-built git state. A gitStatus first read can
tear the stream down for good on
BASE_BRANCH_NOT_FOUNDif it lands between repo-init and base-ref creation. Fix: make the base ref exist atomically (seed a bare origin, thengit clone) and do setup in a subshell so the terminal’s cwd never enters the in-between repo. (code-tab.feature:148) - Faking occlusion by swallowing render output leaks. Suppressing
refreshRowsstill lets an incidental sync paint through. Fix: model the real freeze — parkrequestAnimationFrameso only a forced sync repaint counts. (render_recovery.feature:16) - An edge-triggered effect can’t re-assert after a non-change (app fix). Focus
stolen by a transient element when the reactive
focusedstate didn’t change leaves the edge effect unfired. Fix: bump a level nonce the handler increments so the effect re-asserts. (sub-terminal.feature:107) - Polling a live, reconciling collection throws transients (esp. vitest — no
retry). A poll hits transient stream errors as the collection settles or a key
reconciles out mid-read. Fix: catch-and-retry in the poll, and skip a key
that vanishes between
keys()andget()— the snapshot helper owns the narrow suppression, the poller stays a pure condition-checker. (daemon.test.ts)