[WIP] Bootstrap etcd from consistent_index
Link to https://github.com/etcd-io/etcd/issues/20187
Still need to do some performance comparison (bootstrap time)
Note that this PR can also workaround https://github.com/etcd-io/etcd/issues/20967. Once users run into the issue, they just need to create a single-member cluster first (with --force-new-cluster), and add other members later. cc @thechristschn
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ahrtr
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [ahrtr]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Codecov Report
:x: Patch coverage is 77.22772% with 23 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 69.12%. Comparing base (7bbb4bd) to head (e644406).
Additional details and impacted files
| Files with missing lines | Coverage Δ | |
|---|---|---|
| server/etcdserver/server.go | 82.25% <100.00%> (+0.67%) |
:arrow_up: |
| server/storage/wal/testing/waltesting.go | 60.86% <100.00%> (ø) |
|
| server/verify/verify.go | 41.66% <100.00%> (+21.66%) |
:arrow_up: |
| server/storage/storage.go | 62.79% <66.66%> (+9.84%) |
:arrow_up: |
| server/storage/util.go | 83.56% <83.33%> (+1.98%) |
:arrow_up: |
| etcdutl/etcdutl/migrate_command.go | 0.00% <0.00%> (ø) |
|
| server/storage/wal/wal.go | 58.07% <80.00%> (+0.18%) |
:arrow_up: |
| etcdutl/etcdutl/common.go | 0.00% <0.00%> (-46.16%) |
:arrow_down: |
| server/etcdserver/api/membership/storev2.go | 58.82% <16.66%> (+5.88%) |
:arrow_up: |
| server/etcdserver/bootstrap.go | 66.82% <90.74%> (+2.07%) |
:arrow_up: |
... and 33 files with indirect coverage changes
@@ Coverage Diff @@
## main #20981 +/- ##
==========================================
+ Coverage 68.56% 69.12% +0.56%
==========================================
Files 422 422
Lines 34841 34783 -58
==========================================
+ Hits 23889 24045 +156
+ Misses 9537 9343 -194
+ Partials 1415 1395 -20
Continue to review full report in Codecov by Sentry.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 7bbb4bd...e644406. Read the comment docs.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
https://github.com/etcd-io/etcd/issues/20187#issuecomment-3601435298
need to revisit test cases below in next step,
TestBreakConsistentIndexNewerThanSnapshot
@ahrtr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-etcd-grpcproxy-integration-amd64 | e64440600cbccc10786e75e5f764e0b584051161 | link | true | /test pull-etcd-grpcproxy-integration-amd64 |
| pull-etcd-grpcproxy-integration-arm64 | e64440600cbccc10786e75e5f764e0b584051161 | link | true | /test pull-etcd-grpcproxy-integration-arm64 |
| pull-etcd-robustness-amd64 | e64440600cbccc10786e75e5f764e0b584051161 | link | true | /test pull-etcd-robustness-amd64 |
| pull-etcd-robustness-arm64 | e64440600cbccc10786e75e5f764e0b584051161 | link | true | /test pull-etcd-robustness-arm64 |
| pull-etcd-coverage-report | e64440600cbccc10786e75e5f764e0b584051161 | link | true | /test pull-etcd-coverage-report |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
I think we don't need this change to fix zombie member issue. The workaround should be like fetching snapshot from leader and restoring it. The restore will clean that up
I think we don't need this change to fix zombie member issue. The workaround should be like fetching snapshot from leader and restoring it. The restore will clean that up
pls read https://github.com/etcd-io/etcd/issues/20967#issuecomment-3589944612
EDIT (2025-12-01): I realized that fixing this separately in release-3.6 is quite difficult because it still bootstraps from v2snapshot. Addressing the issue in release-3.6 would be almost equivalent to backporting https://github.com/etcd-io/etcd/pull/20981 to 3.6, which is what I have been trying to avoid. We should be good as long as etcd supports auto-syncing membership data between v2store and v3store as mentioned above.
Not sure I fully understand it. Just for zombie membership issue, from my perspective, even if we load WAL from consistent_index, the change already has been committed into v3store and we can't generate revert change to remove the zombie. So, it could not be workaround for that issue, even if we can backport this to v3.6