etcd icon indicating copy to clipboard operation
etcd copied to clipboard

[WIP] Bootstrap etcd from consistent_index

Open ahrtr opened this issue 3 weeks ago • 7 comments

Link to https://github.com/etcd-io/etcd/issues/20187

Still need to do some performance comparison (bootstrap time)

Note that this PR can also workaround https://github.com/etcd-io/etcd/issues/20967. Once users run into the issue, they just need to create a single-member cluster first (with --force-new-cluster), and add other members later. cc @thechristschn

ahrtr avatar Nov 28 '25 11:11 ahrtr

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 28 '25 11:11 k8s-ci-robot

Codecov Report

:x: Patch coverage is 77.22772% with 23 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 69.12%. Comparing base (7bbb4bd) to head (e644406).

Files with missing lines Patch % Lines
etcdutl/etcdutl/common.go 0.00% 5 Missing :warning:
server/etcdserver/api/membership/storev2.go 16.66% 5 Missing :warning:
server/etcdserver/bootstrap.go 90.74% 2 Missing and 3 partials :warning:
server/storage/wal/wal.go 80.00% 4 Missing :warning:
etcdutl/etcdutl/migrate_command.go 0.00% 2 Missing :warning:
server/storage/storage.go 66.66% 1 Missing :warning:
server/storage/util.go 83.33% 1 Missing :warning:
Additional details and impacted files
Files with missing lines Coverage Δ
server/etcdserver/server.go 82.25% <100.00%> (+0.67%) :arrow_up:
server/storage/wal/testing/waltesting.go 60.86% <100.00%> (ø)
server/verify/verify.go 41.66% <100.00%> (+21.66%) :arrow_up:
server/storage/storage.go 62.79% <66.66%> (+9.84%) :arrow_up:
server/storage/util.go 83.56% <83.33%> (+1.98%) :arrow_up:
etcdutl/etcdutl/migrate_command.go 0.00% <0.00%> (ø)
server/storage/wal/wal.go 58.07% <80.00%> (+0.18%) :arrow_up:
etcdutl/etcdutl/common.go 0.00% <0.00%> (-46.16%) :arrow_down:
server/etcdserver/api/membership/storev2.go 58.82% <16.66%> (+5.88%) :arrow_up:
server/etcdserver/bootstrap.go 66.82% <90.74%> (+2.07%) :arrow_up:

... and 33 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20981      +/-   ##
==========================================
+ Coverage   68.56%   69.12%   +0.56%     
==========================================
  Files         422      422              
  Lines       34841    34783      -58     
==========================================
+ Hits        23889    24045     +156     
+ Misses       9537     9343     -194     
+ Partials     1415     1395      -20     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7bbb4bd...e644406. Read the comment docs.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Nov 28 '25 12:11 codecov[bot]

https://github.com/etcd-io/etcd/issues/20187#issuecomment-3601435298

need to revisit test cases below in next step,

  • TestBreakConsistentIndexNewerThanSnapshot

ahrtr avatar Dec 02 '25 11:12 ahrtr

@ahrtr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-grpcproxy-integration-amd64 e64440600cbccc10786e75e5f764e0b584051161 link true /test pull-etcd-grpcproxy-integration-amd64
pull-etcd-grpcproxy-integration-arm64 e64440600cbccc10786e75e5f764e0b584051161 link true /test pull-etcd-grpcproxy-integration-arm64
pull-etcd-robustness-amd64 e64440600cbccc10786e75e5f764e0b584051161 link true /test pull-etcd-robustness-amd64
pull-etcd-robustness-arm64 e64440600cbccc10786e75e5f764e0b584051161 link true /test pull-etcd-robustness-arm64
pull-etcd-coverage-report e64440600cbccc10786e75e5f764e0b584051161 link true /test pull-etcd-coverage-report

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot avatar Dec 02 '25 11:12 k8s-ci-robot

I think we don't need this change to fix zombie member issue. The workaround should be like fetching snapshot from leader and restoring it. The restore will clean that up

fuweid avatar Dec 17 '25 17:12 fuweid

I think we don't need this change to fix zombie member issue. The workaround should be like fetching snapshot from leader and restoring it. The restore will clean that up

pls read https://github.com/etcd-io/etcd/issues/20967#issuecomment-3589944612

EDIT (2025-12-01): I realized that fixing this separately in release-3.6 is quite difficult because it still bootstraps from v2snapshot. Addressing the issue in release-3.6 would be almost equivalent to backporting https://github.com/etcd-io/etcd/pull/20981 to 3.6, which is what I have been trying to avoid. We should be good as long as etcd supports auto-syncing membership data between v2store and v3store as mentioned above.

ahrtr avatar Dec 17 '25 19:12 ahrtr

Not sure I fully understand it. Just for zombie membership issue, from my perspective, even if we load WAL from consistent_index, the change already has been committed into v3store and we can't generate revert change to remove the zombie. So, it could not be workaround for that issue, even if we can backport this to v3.6

fuweid avatar Dec 18 '25 00:12 fuweid