Add growing member test scenario
For #20137
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: nwnt Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@serathius Not sure if this is the same as what you have in mind. I probably missed a lot of things so definitely would love your feedback about it.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:warning: Please upload report for BASE (main@26dc1bc). Learn more about missing BASE report.
:warning: Report is 449 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #20239 +/- ##
=======================================
Coverage ? 69.44%
=======================================
Files ? 418
Lines ? 34696
Branches ? 0
=======================================
Hits ? 24096
Misses ? 9219
Partials ? 1381
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 26dc1bc...17b929e. Read the comment docs.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Good work on the implementation, implementing a new failpoint is pretty tricky.
Got an error main_test.go:114: failed to read WAL, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 4093, Term: 3], len(ents): 0
Problem we are hitting with PersistedRequestsCluster not being able to read WAL files from added members. Current implementation of PersistedRequestsCluster assumes that quorum members have all entries in the WAL. In this case only the initial member has the full WAL, so it fails. I think we could rewrite it to consider each index individually and accepting it as long as there is no conflict.
Unfortunately to merge this feature we will need to rewrite PersistedRequestsCluster first.
With https://github.com/etcd-io/etcd/pull/20435 we might be able test it again. Can you rebase and see what errors we get? I assume we might need to change how errors are handled when reading WAL.
From pull-etcd-robustness-arm64 might need to debug watch.
traffic.go:126: Traffic finished before failure was injected: context canceled
main_test.go:175: Client didn't collect all events, max revision not set
main_test.go:180: context canceled
main_test.go:107: stat .: no such file or directory
From pull-etcd-integration-1-cpu-amd64 looks like a flake
Interesting no flakes on postsubmit https://testgrid.k8s.io/sig-etcd-postsubmits#post-etcd-integration-4-cpu-arm64&include-filter-by-regex=integration.TestCacheWatchOldRevisionCompacted
/retest
/retest
/retest pull-etcd-robustness-arm64 /retest pull-etcd-robustness-amd64
@serathius: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:
/test pull-etcd-build
/test pull-etcd-contrib-mixin
/test pull-etcd-coverage-report
/test pull-etcd-e2e-386
/test pull-etcd-e2e-amd64
/test pull-etcd-e2e-arm64
/test pull-etcd-fuzzing-v3rpc
/test pull-etcd-govulncheck-main
/test pull-etcd-grpcproxy-e2e-amd64
/test pull-etcd-grpcproxy-e2e-arm64
/test pull-etcd-grpcproxy-integration-amd64
/test pull-etcd-grpcproxy-integration-arm64
/test pull-etcd-integration-1-cpu-amd64
/test pull-etcd-integration-1-cpu-arm64
/test pull-etcd-integration-2-cpu-amd64
/test pull-etcd-integration-2-cpu-arm64
/test pull-etcd-integration-4-cpu-amd64
/test pull-etcd-integration-4-cpu-arm64
/test pull-etcd-markdown-lint
/test pull-etcd-release-tests
/test pull-etcd-robustness-amd64
/test pull-etcd-robustness-arm64
/test pull-etcd-unit-test-386
/test pull-etcd-unit-test-amd64
/test pull-etcd-unit-test-arm64
/test pull-etcd-verify
Use /test all to run the following jobs that were automatically triggered:
pull-etcd-build
pull-etcd-contrib-mixin
pull-etcd-coverage-report
pull-etcd-e2e-386
pull-etcd-e2e-amd64
pull-etcd-e2e-arm64
pull-etcd-fuzzing-v3rpc
pull-etcd-govulncheck-main
pull-etcd-grpcproxy-e2e-amd64
pull-etcd-grpcproxy-e2e-arm64
pull-etcd-grpcproxy-integration-amd64
pull-etcd-grpcproxy-integration-arm64
pull-etcd-integration-1-cpu-amd64
pull-etcd-integration-1-cpu-arm64
pull-etcd-integration-2-cpu-amd64
pull-etcd-integration-2-cpu-arm64
pull-etcd-integration-4-cpu-amd64
pull-etcd-integration-4-cpu-arm64
pull-etcd-release-tests
pull-etcd-robustness-amd64
pull-etcd-robustness-arm64
pull-etcd-unit-test-386
pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64
pull-etcd-verify
In response to this:
/retest pull-etcd-robustness-arm64 /retest pull-etcd-robustness-amd64
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test pull-etcd-robustness-arm64 /test pull-etcd-robustness-amd64
@nwnt: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-etcd-coverage-report | 17b929e6f721aaf241dc5cceb50877ba35101b31 | link | true | /test pull-etcd-coverage-report |
| pull-etcd-robustness-arm64 | 17b929e6f721aaf241dc5cceb50877ba35101b31 | link | true | /test pull-etcd-robustness-arm64 |
| pull-etcd-robustness-amd64 | 17b929e6f721aaf241dc5cceb50877ba35101b31 | link | true | /test pull-etcd-robustness-amd64 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-arm64/1957745366933704704
{"level":"warn","ts":"2025-08-19T10:08:20.850392Z","caller":"etcdserver/server.go:854","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-08-19T10:08:20.850415Z","caller":"etcdserver/server.go:855","msg":"data-dir used by this member must be removed"}
Should I remove each process's data directory after scaling down during Inject?
From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-amd64/1957745366849818624
/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRobustnessExploratoryEtcdTrafficDeleteLeasesClusterOfSize3-test-2) (24040): {"level":"warn","ts":"2025-08-19T10:20:00.008061Z","caller":"etcdserver/server.go:1639","msg":"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"bf19ae4419db00dc","requested-member-remove":"eabdbb777cf498cb","active-peers":1,"error":"etcdserver: unhealthy cluster"}
This looks like the condition I should add to Available. What do you think @serathius?
Should I remove each process's data directory after scaling down during Inject?
No, please don't read random logs.
This looks like the condition I should add to Available. What do you think @serathius?
No, this log is related to fact that membership are rejected if there is suspicion that cluster member are unstable. It requires at least 5 seconds of being connected to all members. This could be mitigated by adding 5 seconds sleep before membership changes.
Please read MemberReplace failpoint, seems there is a some overlap with your case.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.