etcd icon indicating copy to clipboard operation
etcd copied to clipboard

deflake TestCacheLaggingWatcher by waiting for resync

Open apullo777 opened this issue 2 months ago • 5 comments

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

To address #20852

This PR tries to deflake TestCacheLaggingWatcher/pipeline_overflow by adding a short wait before asserting on the collected events.

@serathius

apullo777 avatar Nov 15 '25 14:11 apullo777

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: apullo777 Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 15 '25 14:11 k8s-ci-robot

Hi @apullo777. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 15 '25 14:11 k8s-ci-robot

This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution.

serathius avatar Nov 17 '25 17:11 serathius

This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution.

I just realized I cannot reproduce the specific CI flakes locally.

I noticed my local stress failures (which also affect events_fill_window and event_fill_pipeline) do not have the getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42) log. I suspect my local failures are primarily race conditions between the test reader and the resync loop.

In contrast, the CI failure seems to point to something deeper, perhaps related to our watch retry logic reusing a stale start revision (broadcasting event at rev 32 with demux.maxRev=42)?

Are you able to reproduce that specific CI flake locally? And would you suggest I focus on finding a way to stabilize the test races first, or prioritize finding a reproduction path for the CI flake?

apullo777 avatar Nov 22 '25 00:11 apullo777

I've dug deeper and it looks like the root cause is in the cache internals: the internal store watcher was using the same buffer size as client watchers, causing it to lag during burst traffic. When it lagged, the resync mechanism tried to replay stale events (e.g., rev 32) to a store that had already progressed (rev 42), causing the validation error.

I implemented a fix that:

  1. Gives the internal store watcher a dedicated buffer (auto-sized to 50x larger)
  2. Prevents it from ever being resynced (proper restart path instead)
  3. Scales to production workloads without cascading failures

The fix passes 100% in stress testing (20+ consecutive runs). I'll open a new PR with this root-cause fix and link it here. Thanks for investigating this! Your work helped identify the pattern.

rockswe avatar Nov 22 '25 06:11 rockswe