etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Flaky TestCacheLaggingWatcher

Open ivanvc opened this issue 2 months ago • 9 comments

Which Github Action / Prow Jobs are flaking?

pull-etcd-coverage-report

Which tests are flaking?

TestCacheLaggingWatcher

Github Action / Prow Job link

https://testgrid.k8s.io/sig-etcd-presubmits#pull-etcd-coverage-report&exclude-non-failed-tests=&include-filter-by-regex=TestCacheLaggingWatcher&width=20&show-stale-tests=

Reason for failure (if possible)

It says:

 === Failed
=== FAIL: integration TestCacheLaggingWatcher/pipeline_overflow (0.11s)
getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42)
    cache_test.go:1123: gotEvents=10, wantEvents<1
getWatch failed, will retry after 50ms: context canceled
    --- FAIL: TestCacheLaggingWatcher/pipeline_overflow (0.11s)
=== FAIL: integration TestCacheLaggingWatcher (0.77s) 

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-coverage-report/1981495416251748352#

Anything else we need to know?

This has been failing in recent days.

ivanvc avatar Oct 24 '25 00:10 ivanvc

Hi Iván! @ivanvc Would you like me to work on this issue?

rockswe avatar Oct 24 '25 00:10 rockswe

Yes, @rockswe! Please feel free to work on this. It's becoming an issue in almost every CI run now. So, I'd say it's a high priority. If you want to work on this, please post a comment with /assign.

Thanks :)

ivanvc avatar Oct 24 '25 02:10 ivanvc

/assign

rockswe avatar Oct 24 '25 02:10 rockswe

/cc @apullo777

serathius avatar Oct 24 '25 08:10 serathius

Thanks for flagging this flake and thanks @rockswe for help. @serathius mentioned it to me a few weeks ago, but I’ve been swamped lately and haven’t had a chance to look into it.

For more context, the test tries to prove: “if a watcher’s tiny buffer overflows, it gets marked lagging and its channel closes.” It also expects the watcher to see at most 1 event before closing, because one send may already be mid-flight when overflow is detected.

I was thinking that the flake came from a race between when the watcher overflows (gets marked “lagging”) and when the periodic resync runs. After the watcher overflows, the resync may either cancel it (if it fell too far behind) or quickly catch it up from recent history. With a 10-event history and 12 writes, sometimes overflow happens late -> resync replays ~10 events. So we occasionally saw gotEvents = 10 result.

Maybe we should try to make the test more deterministic? I can reproduce the flake locally, and adding a short sleep before the read seems to make it more stable.

apullo777 avatar Oct 24 '25 16:10 apullo777

Thank you @apullo777!

I agree on making the test deterministic.

The flake is a combo of (a) batching, one buffered response can contain many events, and (b) the resync loop racing in and replaying history after overflow. So the invariant that holds with a tiny buffer is “≤1 WatchResponse before the watcher is marked lagging,” not “≤1 event.”

I would:

  1. Disable resync for this test by using a huge resyncInterval so it won’t fire during the test run.

  2. Assert on response count, not event count.

rockswe avatar Oct 24 '25 18:10 rockswe

any progress on the issue?

serathius avatar Nov 12 '25 00:11 serathius

any progress on the issue?

Hey @serathius, I tried a small, targeted change that adds a short sleep right after generateEvents() for the pipeline overflow case; it’s not a guaranteed fix, but empirically it noticeably deflakes the test.

Proposed change:

 generateEvents(t, client, prefix, tt.eventCount)

    if tt.name == "pipeline overflow" {
        time.Sleep(100 * time.Millisecond)
    }

    gotEvents, ok := collectAndAssertAtomicEvents(t, ch)
    closed := !ok

Stress command I used locally to reproduce the flake:

GOMAXPROCS=1 $HOME/go/bin/stress -p 8 -timeout 20m \         
  ./integration.test -test.v -test.run 'TestCacheLaggingWatcher/pipeline_overflow$' \

before the change:

…
6m5s: 1221 runs so far, 5 failures (0.41%), 8 active
6m10s: 1245 runs so far, 5 failures (0.40%), 8 active
6m15s: 1261 runs so far, 5 failures (0.40%), 8 active

after the change:

…
6m35s: 1337 runs so far, 0 failures, 8 active
6m40s: 1353 runs so far, 0 failures, 8 active
6m45s: 1369 runs so far, 0 failures, 8 active

Should I open a PR with the 100ms sleep so we can confirm whether CI also stops flaking?

apullo777 avatar Nov 13 '25 16:11 apullo777

Sure

serathius avatar Nov 15 '25 08:11 serathius