etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Add growing member test scenario

Open nwnt opened this issue 6 months ago • 19 comments

For #20137

nwnt avatar Jun 28 '25 03:06 nwnt

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nwnt Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jun 28 '25 03:06 k8s-ci-robot

@serathius Not sure if this is the same as what you have in mind. I probably missed a lot of things so definitely would love your feedback about it.

nwnt avatar Jun 28 '25 03:06 nwnt

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :warning: Please upload report for BASE (main@26dc1bc). Learn more about missing BASE report. :warning: Report is 449 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20239   +/-   ##
=======================================
  Coverage        ?   69.44%           
=======================================
  Files           ?      418           
  Lines           ?    34696           
  Branches        ?        0           
=======================================
  Hits            ?    24096           
  Misses          ?     9219           
  Partials        ?     1381           

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 26dc1bc...17b929e. Read the comment docs.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 28 '25 04:06 codecov[bot]

Good work on the implementation, implementing a new failpoint is pretty tricky.

Got an error main_test.go:114: failed to read WAL, err: wal: slice bounds out of range, snapshot[Index: 0, Term: 0], current entry[Index: 4093, Term: 3], len(ents): 0

Problem we are hitting with PersistedRequestsCluster not being able to read WAL files from added members. Current implementation of PersistedRequestsCluster assumes that quorum members have all entries in the WAL. In this case only the initial member has the full WAL, so it fails. I think we could rewrite it to consider each index individually and accepting it as long as there is no conflict.

Unfortunately to merge this feature we will need to rewrite PersistedRequestsCluster first.

serathius avatar Jun 30 '25 10:06 serathius

With https://github.com/etcd-io/etcd/pull/20435 we might be able test it again. Can you rebase and see what errors we get? I assume we might need to change how errors are handled when reading WAL.

serathius avatar Aug 11 '25 10:08 serathius

From pull-etcd-robustness-arm64 might need to debug watch.

    traffic.go:126: Traffic finished before failure was injected: context canceled
    main_test.go:175: Client didn't collect all events, max revision not set
    main_test.go:180: context canceled
    main_test.go:107: stat .: no such file or directory

serathius avatar Aug 19 '25 06:08 serathius

From pull-etcd-integration-1-cpu-amd64 looks like a flake

serathius avatar Aug 19 '25 06:08 serathius

Interesting no flakes on postsubmit https://testgrid.k8s.io/sig-etcd-postsubmits#post-etcd-integration-4-cpu-arm64&include-filter-by-regex=integration.TestCacheWatchOldRevisionCompacted

serathius avatar Aug 19 '25 06:08 serathius

/retest

serathius avatar Aug 19 '25 08:08 serathius

/retest

serathius avatar Aug 19 '25 09:08 serathius

/retest pull-etcd-robustness-arm64 /retest pull-etcd-robustness-amd64

serathius avatar Aug 19 '25 10:08 serathius

@serathius: The /retest command does not accept any targets. The following commands are available to trigger required jobs:

/test pull-etcd-build
/test pull-etcd-contrib-mixin
/test pull-etcd-coverage-report
/test pull-etcd-e2e-386
/test pull-etcd-e2e-amd64
/test pull-etcd-e2e-arm64
/test pull-etcd-fuzzing-v3rpc
/test pull-etcd-govulncheck-main
/test pull-etcd-grpcproxy-e2e-amd64
/test pull-etcd-grpcproxy-e2e-arm64
/test pull-etcd-grpcproxy-integration-amd64
/test pull-etcd-grpcproxy-integration-arm64
/test pull-etcd-integration-1-cpu-amd64
/test pull-etcd-integration-1-cpu-arm64
/test pull-etcd-integration-2-cpu-amd64
/test pull-etcd-integration-2-cpu-arm64
/test pull-etcd-integration-4-cpu-amd64
/test pull-etcd-integration-4-cpu-arm64
/test pull-etcd-markdown-lint
/test pull-etcd-release-tests
/test pull-etcd-robustness-amd64
/test pull-etcd-robustness-arm64
/test pull-etcd-unit-test-386
/test pull-etcd-unit-test-amd64
/test pull-etcd-unit-test-arm64
/test pull-etcd-verify

Use /test all to run the following jobs that were automatically triggered:

pull-etcd-build
pull-etcd-contrib-mixin
pull-etcd-coverage-report
pull-etcd-e2e-386
pull-etcd-e2e-amd64
pull-etcd-e2e-arm64
pull-etcd-fuzzing-v3rpc
pull-etcd-govulncheck-main
pull-etcd-grpcproxy-e2e-amd64
pull-etcd-grpcproxy-e2e-arm64
pull-etcd-grpcproxy-integration-amd64
pull-etcd-grpcproxy-integration-arm64
pull-etcd-integration-1-cpu-amd64
pull-etcd-integration-1-cpu-arm64
pull-etcd-integration-2-cpu-amd64
pull-etcd-integration-2-cpu-arm64
pull-etcd-integration-4-cpu-amd64
pull-etcd-integration-4-cpu-arm64
pull-etcd-release-tests
pull-etcd-robustness-amd64
pull-etcd-robustness-arm64
pull-etcd-unit-test-386
pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64
pull-etcd-verify

In response to this:

/retest pull-etcd-robustness-arm64 /retest pull-etcd-robustness-amd64

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Aug 19 '25 10:08 k8s-ci-robot

/test pull-etcd-robustness-arm64 /test pull-etcd-robustness-amd64

serathius avatar Aug 19 '25 10:08 serathius

@nwnt: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-coverage-report 17b929e6f721aaf241dc5cceb50877ba35101b31 link true /test pull-etcd-coverage-report
pull-etcd-robustness-arm64 17b929e6f721aaf241dc5cceb50877ba35101b31 link true /test pull-etcd-robustness-arm64
pull-etcd-robustness-amd64 17b929e6f721aaf241dc5cceb50877ba35101b31 link true /test pull-etcd-robustness-amd64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Aug 19 '25 10:08 k8s-ci-robot

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-arm64/1957745366933704704

{"level":"warn","ts":"2025-08-19T10:08:20.850392Z","caller":"etcdserver/server.go:854","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2025-08-19T10:08:20.850415Z","caller":"etcdserver/server.go:855","msg":"data-dir used by this member must be removed"}

Should I remove each process's data directory after scaling down during Inject?

nwnt avatar Aug 20 '25 03:08 nwnt

From https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/etcd-io_etcd/20239/pull-etcd-robustness-amd64/1957745366849818624

/home/prow/go/src/github.com/etcd-io/etcd/bin/etcd (TestRobustnessExploratoryEtcdTrafficDeleteLeasesClusterOfSize3-test-2) (24040): {"level":"warn","ts":"2025-08-19T10:20:00.008061Z","caller":"etcdserver/server.go:1639","msg":"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"bf19ae4419db00dc","requested-member-remove":"eabdbb777cf498cb","active-peers":1,"error":"etcdserver: unhealthy cluster"}

This looks like the condition I should add to Available. What do you think @serathius?

nwnt avatar Aug 20 '25 03:08 nwnt

Should I remove each process's data directory after scaling down during Inject?

No, please don't read random logs.

This looks like the condition I should add to Available. What do you think @serathius?

No, this log is related to fact that membership are rejected if there is suspicion that cluster member are unstable. It requires at least 5 seconds of being connected to all members. This could be mitigated by adding 5 seconds sleep before membership changes.

serathius avatar Aug 20 '25 09:08 serathius

Please read MemberReplace failpoint, seems there is a some overlap with your case.

serathius avatar Aug 20 '25 09:08 serathius

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Oct 27 '25 00:10 github-actions[bot]