etcd Add growing member test scenario

For #20137

Jun 28 '25 03:06 nwnt

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nwnt Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jun 28 '25 03:06 k8s-ci-robot

@serathius Not sure if this is the same as what you have in mind. I probably missed a lot of things so definitely would love your feedback about it.

Jun 28 '25 03:06 nwnt

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :warning: Please upload report for BASE (main@26dc1bc). Learn more about missing BASE report. :warning: Report is 449 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #20239   +/-   ##
=======================================
  Coverage        ?   69.44%           
=======================================
  Files           ?      418           
  Lines           ?    34696           
  Branches        ?        0           
=======================================
  Hits            ?    24096           
  Misses          ?     9219           
  Partials        ?     1381

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 26dc1bc...17b929e. Read the comment docs.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Jun 28 '25 04:06 codecov[bot]

Good work on the implementation, implementing a new failpoint is pretty tricky.

Got an error main_test.go:114: failed to read WAL, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 4093, Term: 3], len(ents): 0

Problem we are hitting with PersistedRequestsCluster not being able to read WAL files from added members. Current implementation of PersistedRequestsCluster assumes that quorum members have all entries in the WAL. In this case only the initial member has the full WAL, so it fails. I think we could rewrite it to consider each index individually and accepting it as long as there is no conflict.

Unfortunately to merge this feature we will need to rewrite PersistedRequestsCluster first.

Jun 30 '25 10:06 serathius

With https://github.com/etcd-io/etcd/pull/20435 we might be able test it again. Can you rebase and see what errors we get? I assume we might need to change how errors are handled when reading WAL.

Aug 11 '25 10:08 serathius

From pull-etcd-robustness-arm64 might need to debug watch.

    traffic.go:126: Traffic finished before failure was injected: context canceled
    main_test.go:175: Client didn't collect all events, max revision not set
    main_test.go:180: context canceled
    main_test.go:107: stat .: no such file or directory

Aug 19 '25 06:08 serathius

From pull-etcd-integration-1-cpu-amd64 looks like a flake

Aug 19 '25 06:08 serathius

Interesting no flakes on postsubmit https://testgrid.k8s.io/sig-etcd-postsubmits#post-etcd-integration-4-cpu-arm64&include-filter-by-regex=integration.TestCacheWatchOldRevisionCompacted

Aug 19 '25 06:08 serathius

/retest

Aug 19 '25 08:08 serathius

/retest

Aug 19 '25 09:08 serathius

/retest pull-etcd-robustness-arm64 /retest pull-etcd-robustness-amd64

Aug 19 '25 10:08 serathius

@serathius: The /retest command does not accept any targets. The following commands are available to trigger required jobs:

/test pull-etcd-build

/test pull-etcd-contrib-mixin

/test pull-etcd-coverage-report

/test pull-etcd-e2e-386

/test pull-etcd-e2e-amd64

/test pull-etcd-e2e-arm64

/test pull-etcd-fuzzing-v3rpc

/test pull-etcd-govulncheck-main

/test pull-etcd-grpcproxy-e2e-amd64

/test pull-etcd-grpcproxy-e2e-arm64

/test pull-etcd-grpcproxy-integration-amd64

/test pull-etcd-grpcproxy-integration-arm64

/test pull-etcd-integration-1-cpu-amd64

/test pull-etcd-integration-1-cpu-arm64

/test pull-etcd-integration-2-cpu-amd64

/test pull-etcd-integration-2-cpu-arm64

/test pull-etcd-integration-4-cpu-amd64

/test pull-etcd-integration-4-cpu-arm64

/test pull-etcd-markdown-lint

/test pull-etcd-release-tests

/test pull-etcd-robustness-amd64

/test pull-etcd-robustness-arm64

/test pull-etcd-unit-test-386

/test pull-etcd-unit-test-amd64

/test pull-etcd-unit-test-arm64

/test pull-etcd-verify

Use /test all to run the following jobs that were automatically triggered:

pull-etcd-build

pull-etcd-contrib-mixin

pull-etcd-coverage-report

pull-etcd-e2e-386

pull-etcd-e2e-amd64

pull-etcd-e2e-arm64

pull-etcd-fuzzing-v3rpc

pull-etcd-govulncheck-main

pull-etcd-grpcproxy-e2e-amd64

pull-etcd-grpcproxy-e2e-arm64

pull-etcd-grpcproxy-integration-amd64

pull-etcd-grpcproxy-integration-arm64

pull-etcd-integration-1-cpu-amd64

pull-etcd-integration-1-cpu-arm64

pull-etcd-integration-2-cpu-amd64

pull-etcd-integration-2-cpu-arm64

pull-etcd-integration-4-cpu-amd64

pull-etcd-integration-4-cpu-arm64

pull-etcd-release-tests

pull-etcd-robustness-amd64

pull-etcd-robustness-arm64

pull-etcd-unit-test-386

pull-etcd-unit-test-amd64

pull-etcd-unit-test-arm64

pull-etcd-verify

In response to this:

/retest pull-etcd-robustness-arm64 /retest pull-etcd-robustness-amd64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 19 '25 10:08 k8s-ci-robot

/test pull-etcd-robustness-arm64 /test pull-etcd-robustness-amd64

Aug 19 '25 10:08 serathius

@nwnt: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-coverage-report	17b929e6f721aaf241dc5cceb50877ba35101b31	link	true	`/test pull-etcd-coverage-report`
pull-etcd-robustness-arm64	17b929e6f721aaf241dc5cceb50877ba35101b31	link	true	`/test pull-etcd-robustness-arm64`
pull-etcd-robustness-amd64	17b929e6f721aaf241dc5cceb50877ba35101b31	link	true	`/test pull-etcd-robustness-amd64`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Aug 19 '25 10:08 k8s-ci-robot

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-arm64/1957745366933704704

{"level":"warn","ts":"2025-08-19T10:08:20.850392Z","caller":"etcdserver/server.go:854","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-08-19T10:08:20.850415Z","caller":"etcdserver/server.go:855","msg":"data-dir used by this member must be removed"}

Should I remove each process's data directory after scaling down during Inject?

Aug 20 '25 03:08 nwnt

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-amd64/1957745366849818624

/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRobustnessExploratoryEtcdTrafficDeleteLeasesClusterOfSize3-test-2) (24040): {"level":"warn","ts":"2025-08-19T10:20:00.008061Z","caller":"etcdserver/server.go:1639","msg":"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"bf19ae4419db00dc","requested-member-remove":"eabdbb777cf498cb","active-peers":1,"error":"etcdserver: unhealthy cluster"}

This looks like the condition I should add to Available. What do you think @serathius?

Aug 20 '25 03:08 nwnt

Should I remove each process's data directory after scaling down during Inject?

No, please don't read random logs.

This looks like the condition I should add to Available. What do you think @serathius?

No, this log is related to fact that membership are rejected if there is suspicion that cluster member are unstable. It requires at least 5 seconds of being connected to all members. This could be mitigated by adding 5 seconds sleep before membership changes.

Aug 20 '25 09:08 serathius

Please read MemberReplace failpoint, seems there is a some overlap with your case.

Aug 20 '25 09:08 serathius

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Oct 27 '25 00:10 github-actions[bot]