Flaky TestCacheLaggingWatcher
Which Github Action / Prow Jobs are flaking?
pull-etcd-coverage-report
Which tests are flaking?
TestCacheLaggingWatcher
Github Action / Prow Job link
https://testgrid.k8s.io/sig-etcd-presubmits#pull-etcd-coverage-report&exclude-non-failed-tests=&include-filter-by-regex=TestCacheLaggingWatcher&width=20&show-stale-tests=
Reason for failure (if possible)
It says:
=== Failed
=== FAIL: integration TestCacheLaggingWatcher/pipeline_overflow (0.11s)
getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42)
cache_test.go:1123: gotEvents=10, wantEvents<1
getWatch failed, will retry after 50ms: context canceled
--- FAIL: TestCacheLaggingWatcher/pipeline_overflow (0.11s)
=== FAIL: integration TestCacheLaggingWatcher (0.77s)
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-coverage-report/1981495416251748352#
Anything else we need to know?
This has been failing in recent days.
Hi Iván! @ivanvc Would you like me to work on this issue?
Yes, @rockswe! Please feel free to work on this. It's becoming an issue in almost every CI run now. So, I'd say it's a high priority. If you want to work on this, please post a comment with /assign.
Thanks :)
/assign
/cc @apullo777
Thanks for flagging this flake and thanks @rockswe for help. @serathius mentioned it to me a few weeks ago, but I’ve been swamped lately and haven’t had a chance to look into it.
For more context, the test tries to prove: “if a watcher’s tiny buffer overflows, it gets marked lagging and its channel closes.” It also expects the watcher to see at most 1 event before closing, because one send may already be mid-flight when overflow is detected.
I was thinking that the flake came from a race between when the watcher overflows (gets marked “lagging”) and when the periodic resync runs. After the watcher overflows, the resync may either cancel it (if it fell too far behind) or quickly catch it up from recent history. With a 10-event history and 12 writes, sometimes overflow happens late -> resync replays ~10 events. So we occasionally saw gotEvents = 10 result.
Maybe we should try to make the test more deterministic? I can reproduce the flake locally, and adding a short sleep before the read seems to make it more stable.
Thank you @apullo777!
I agree on making the test deterministic.
The flake is a combo of (a) batching, one buffered response can contain many events, and (b) the resync loop racing in and replaying history after overflow. So the invariant that holds with a tiny buffer is “≤1 WatchResponse before the watcher is marked lagging,” not “≤1 event.”
I would:
-
Disable resync for this test by using a huge
resyncIntervalso it won’t fire during the test run. -
Assert on response count, not event count.
any progress on the issue?
any progress on the issue?
Hey @serathius, I tried a small, targeted change that adds a short sleep right after generateEvents() for the pipeline overflow case; it’s not a guaranteed fix, but empirically it noticeably deflakes the test.
Proposed change:
generateEvents(t, client, prefix, tt.eventCount)
if tt.name == "pipeline overflow" {
time.Sleep(100 * time.Millisecond)
}
gotEvents, ok := collectAndAssertAtomicEvents(t, ch)
closed := !ok
Stress command I used locally to reproduce the flake:
GOMAXPROCS=1 $HOME/go/bin/stress -p 8 -timeout 20m \
./integration.test -test.v -test.run 'TestCacheLaggingWatcher/pipeline_overflow$' \
before the change:
…
6m5s: 1221 runs so far, 5 failures (0.41%), 8 active
6m10s: 1245 runs so far, 5 failures (0.40%), 8 active
6m15s: 1261 runs so far, 5 failures (0.40%), 8 active
after the change:
…
6m35s: 1337 runs so far, 0 failures, 8 active
6m40s: 1353 runs so far, 0 failures, 8 active
6m45s: 1369 runs so far, 0 failures, 8 active
Should I open a PR with the 100ms sleep so we can confirm whether CI also stops flaking?
Sure