etcd icon indicating copy to clipboard operation
etcd copied to clipboard

3.6.0: server fails to start, nil pointer exception when re-using storage with `force-new-cluster`

Open reneradoi opened this issue 7 months ago • 8 comments

Bug report criteria

  • [x] This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
  • [x] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
  • [x] You have read the etcd bug reporting guidelines.
  • [x] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Hi etcd Team!

After etcd 3.6 has been released, I am evaluating it against my team's disaster recovery mechanisms. One of the scenarios is "storage re-use":

  • scale down an existing etcd cluster (multiple members)
  • mount the storage volume that includes the former cluster leader's data directory to a new machine
  • start etcd on that new machine with existing data and --force-new-cluster

This has worked fine until 3.5.21, but startup of etcd on the new machine crashes with 3.6. This is the error message in the logfile:

{"level":"panic","ts":"2025-05-22T08:11:19.786312Z","caller":"schedule/schedule.go:202","msg":"execute job failed","job":"server_applyAll","panic":"runtime error: invalid memory address or nil pointer dereference","stacktrace":"go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob.func1\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:202\nruntime.gopanic\n\truntime/panic.go:791\nruntime.panicmem\n\truntime/panic.go:262\nruntime.sigpanic\n\truntime/signal_unix.go:917\ngo.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).PromoteMember\n\tgo.etcd.io/etcd/server/v3/etcdserver/api/membership/cluster.go:525\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2103\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1902\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1194\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:979\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:855\ngo.etcd.io/etcd/pkg/v3/schedule.job.Do\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:41\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:206\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:187"}

I am aware that forcing a new cluster from existing data is not the preferred approach for disaster recovery (we also use backup/restore). But as this worked with previous versions, it might be worth to investigate.

What did you expect to happen?

Etcd on the new machine should start and update its membership configuration, as described in the disaster recovery docs: Alternatively, when starting up etcd, you can supply --force-new-cluster to overwrite cluster membership while keeping existing application data.

How can we reproduce it (as minimally and precisely as possible)?

  • create an etcd cluster with default configuration and static discovery, as described in the clustering guide docs
  • stop etcd on all machines
  • mount the volume with the leader's data-directory to a new machine (or copy the data directory to the new machine)
  • start etcd on the new machine with default configuration + force-new-cluster
  • etcd does not start on the new machine

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
3.6.0

$ etcdctl version
etcdctl version: 3.6.0
API version: 3.6

</details>


### Etcd configuration (command line flags or environment variables)

<details>

# paste your configuration here

</details>


### Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

{"level":"warn","ts":"2025-05-22T08:11:19.701122Z","caller":"embed/config.go:1209","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2025-05-22T08:11:19.701597Z","caller":"etcdmain/config.go:176","msg":"loaded server configuration, other configuration command line flags and environment variables will be ignored if provided","path":"/var/lib/etcd/etcd.conf.yml"}
{"level":"warn","ts":"2025-05-22T08:11:19.701644Z","caller":"etcdmain/config.go:270","msg":"--snapshot-count is deprecated in 3.6 and will be decommissioned in 3.7."}
{"level":"warn","ts":"2025-05-22T08:11:19.701665Z","caller":"etcdmain/config.go:270","msg":"--max-snapshots is deprecated in 3.6 and will be decommissioned in 3.7."}
{"level":"info","ts":"2025-05-22T08:11:19.701698Z","caller":"etcdmain/etcd.go:64","msg":"Running: ","args":["/bin/etcd"]}
{"level":"info","ts":"2025-05-22T08:11:19.701799Z","caller":"etcdmain/etcd.go:107","msg":"server has already been initialized","data-dir":"/var/lib/etcd/etcd","dir-type":"member"}
{"level":"warn","ts":"2025-05-22T08:11:19.701848Z","caller":"embed/config.go:1209","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2025-05-22T08:11:19.701861Z","caller":"embed/etcd.go:138","msg":"configuring peer listeners","listen-peer-urls":["http://10.136.16.10:2380"]}
{"level":"info","ts":"2025-05-22T08:11:19.706525Z","caller":"embed/etcd.go:146","msg":"configuring client listeners","listen-client-urls":["http://10.136.16.10:2379"]}
{"level":"info","ts":"2025-05-22T08:11:19.706584Z","caller":"embed/etcd.go:658","msg":"pprof is enabled","path":"/debug/pprof"}
{"level":"info","ts":"2025-05-22T08:11:19.706792Z","caller":"embed/etcd.go:323","msg":"starting an etcd server","etcd-version":"3.6.0","git-sha":"f5d605a93","go-version":"go1.23.9","go-os":"linux","go-arch":"amd64","max-cpu-set":8,"max-cpu-available":8,"member-initialized":true,"name":"etcd3","data-dir":"/var/lib/etcd/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/etcd/member","force-new-cluster":true,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://10.136.16.10:2380"],"listen-peer-urls":["http://10.136.16.10:2380"],"advertise-client-urls":["http://10.136.16.10:2379"],"listen-client-urls":["http://10.136.16.10:2379"],"listen-metrics-urls":["http://10.136.16.10:9100"],"experimental-local-address":"","cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"feature-gates":"","initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"1h0m0s","auto-compaction-interval":"1h0m0s","discovery-url":"","discovery-proxy":"","discovery-token":"","discovery-endpoints":"","discovery-dial-timeout":"2s","discovery-request-timeout":"5s","discovery-keepalive-time":"2s","discovery-keepalive-timeout":"6s","discovery-insecure-transport":true,"discovery-insecure-skip-tls-verify":false,"discovery-cert":"","discovery-key":"","discovery-cacert":"","discovery-user":"","downgrade-check-interval":"5s","max-learners":1,"v2-deprecation":"write-only"}
{"level":"warn","ts":"2025-05-22T08:11:19.706875Z","caller":"fileutil/fileutil.go:56","msg":"check file permission","error":"directory \"/var/lib/etcd/etcd\" exist, but the permission is \"drwxr-x---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"warn","ts":"2025-05-22T08:11:19.707001Z","caller":"fileutil/fileutil.go:56","msg":"check file permission","error":"directory \"/var/lib/etcd/etcd/member\" exist, but the permission is \"drwxr-x---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"warn","ts":"2025-05-22T08:11:19.707153Z","caller":"fileutil/fileutil.go:56","msg":"check file permission","error":"directory \"/var/lib/etcd/etcd/member/snap\" exist, but the permission is \"drwxr-x---\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
{"level":"info","ts":"2025-05-22T08:11:19.733653Z","logger":"bbolt","caller":"backend/backend.go:203","msg":"Opening db file (/var/lib/etcd/etcd/member/snap/db) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: hashmap, ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0xc00018c558}"}
{"level":"info","ts":"2025-05-22T08:11:19.735628Z","logger":"bbolt","caller":"[email protected]/db.go:321","msg":"Opening bbolt db (/var/lib/etcd/etcd/member/snap/db) successfully"}
{"level":"info","ts":"2025-05-22T08:11:19.735687Z","caller":"storage/backend.go:80","msg":"opened backend db","path":"/var/lib/etcd/etcd/member/snap/db","took":"24.843354ms"}
{"level":"info","ts":"2025-05-22T08:11:19.735743Z","caller":"etcdserver/bootstrap.go:219","msg":"restore consistentIndex","index":781}
{"level":"info","ts":"2025-05-22T08:11:19.736951Z","caller":"etcdserver/bootstrap.go:440","msg":"No snapshot found. Recovering WAL from scratch!"}
{"level":"info","ts":"2025-05-22T08:11:19.736993Z","caller":"etcdserver/bootstrap.go:231","msg":"recovered v3 backend","backend-size-bytes":49152,"backend-size":"49 kB","backend-size-in-use-bytes":36864,"backend-size-in-use":"37 kB"}
{"level":"info","ts":"2025-05-22T08:11:19.741079Z","caller":"etcdserver/bootstrap.go:584","msg":"restarting local member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","commit-index":781}
{"level":"info","ts":"2025-05-22T08:11:19.741113Z","caller":"etcdserver/bootstrap.go:93","msg":"bootstrapping cluster"}
{"level":"info","ts":"2025-05-22T08:11:19.741202Z","caller":"etcdserver/bootstrap.go:100","msg":"bootstrapping storage"}
{"level":"info","ts":"2025-05-22T08:11:19.741327Z","caller":"membership/cluster.go:587","msg":"Detected member only in v3store but missing in v2store","member":"{ID:7bfbc310c445a737 RaftAttributes:{PeerURLs:[http://10.136.16.157:2380] IsLearner:false} Attributes:{Name:etcd4 ClientURLs:[http://10.136.16.157:2379]}}"}
{"level":"info","ts":"2025-05-22T08:11:19.751143Z","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version":"3.6"}
{"level":"info","ts":"2025-05-22T08:11:19.751160Z","caller":"membership/cluster.go:297","msg":"recovered/added member from store","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","recovered-remote-peer-id":"7bfbc310c445a737","recovered-remote-peer-urls":["http://10.136.16.157:2380"],"recovered-remote-peer-is-learner":false}
{"level":"info","ts":"2025-05-22T08:11:19.751189Z","caller":"membership/cluster.go:307","msg":"set cluster version from store","cluster-version":"3.6"}
{"level":"info","ts":"2025-05-22T08:11:19.751199Z","caller":"etcdserver/bootstrap.go:108","msg":"bootstrapping raft"}
{"level":"info","ts":"2025-05-22T08:11:19.751251Z","caller":"etcdserver/server.go:312","msg":"bootstrap successfully"}
{"level":"info","ts":"2025-05-22T08:11:19.751350Z","logger":"raft","caller":"[email protected]/raft.go:1981","msg":"7bfbc310c445a737 switched to configuration voters=()"}
{"level":"info","ts":"2025-05-22T08:11:19.757086Z","logger":"raft","caller":"[email protected]/raft.go:897","msg":"7bfbc310c445a737 became follower at term 4"}
{"level":"info","ts":"2025-05-22T08:11:19.757101Z","logger":"raft","caller":"[email protected]/raft.go:493","msg":"newRaft 7bfbc310c445a737 [peers: [], term: 4, commit: 781, applied: 0, lastindex: 781, lastterm: 4]"}
{"level":"warn","ts":"2025-05-22T08:11:19.758860Z","caller":"auth/store.go:1135","msg":"simple token is not cryptographically signed"}
{"level":"info","ts":"2025-05-22T08:11:19.763123Z","caller":"mvcc/kvstore.go:408","msg":"kvstore restored","current-rev":307}
{"level":"info","ts":"2025-05-22T08:11:19.765866Z","caller":"storage/quota.go:93","msg":"enabled backend quota with default value","quota-name":"v3-applier","quota-size-bytes":2147483648,"quota-size":"2.1 GB"}
{"level":"info","ts":"2025-05-22T08:11:19.765938Z","caller":"etcdserver/server.go:589","msg":"starting etcd server","local-member-id":"7bfbc310c445a737","local-server-version":"3.6.0","cluster-id":"1fff5ae36329cc03","cluster-version":"3.6"}
{"level":"info","ts":"2025-05-22T08:11:19.766138Z","caller":"etcdserver/server.go:483","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"7bfbc310c445a737","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"}
{"level":"info","ts":"2025-05-22T08:11:19.766506Z","caller":"embed/etcd.go:292","msg":"now serving peer/client/metrics","local-member-id":"7bfbc310c445a737","initial-advertise-peer-urls":["http://10.136.16.10:2380"],"listen-peer-urls":["http://10.136.16.10:2380"],"advertise-client-urls":["http://10.136.16.10:2379"],"listen-client-urls":["http://10.136.16.10:2379"],"listen-metrics-urls":["http://10.136.16.10:9100"]}
{"level":"info","ts":"2025-05-22T08:11:19.766609Z","caller":"embed/etcd.go:890","msg":"serving metrics","address":"http://10.136.16.10:9100"}
{"level":"info","ts":"2025-05-22T08:11:19.769879Z","caller":"fileutil/purge.go:49","msg":"started to purge file","dir":"/var/lib/etcd/etcd/member/snap","suffix":"snap.db","max":5,"interval":"30s"}
{"level":"info","ts":"2025-05-22T08:11:19.770022Z","caller":"fileutil/purge.go:49","msg":"started to purge file","dir":"/var/lib/etcd/etcd/member/snap","suffix":"snap","max":5,"interval":"30s"}
{"level":"info","ts":"2025-05-22T08:11:19.770040Z","caller":"fileutil/purge.go:49","msg":"started to purge file","dir":"/var/lib/etcd/etcd/member/wal","suffix":"wal","max":5,"interval":"30s"}
{"level":"info","ts":"2025-05-22T08:11:19.774132Z","caller":"embed/etcd.go:640","msg":"serving peer traffic","address":"10.136.16.10:2380"}
{"level":"info","ts":"2025-05-22T08:11:19.774185Z","caller":"embed/etcd.go:611","msg":"cmux::serve","address":"10.136.16.10:2380"}
{"level":"info","ts":"2025-05-22T08:11:19.777709Z","logger":"raft","caller":"[email protected]/raft.go:1981","msg":"7bfbc310c445a737 switched to configuration voters=(16208681750449579538)"}
{"level":"info","ts":"2025-05-22T08:11:19.777872Z","caller":"membership/cluster.go:433","msg":"ignore already added member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","added-peer-id":"e0f0ce2cbce76612","added-peer-peer-urls":["http://10.136.16.101:2380"],"added-peer-is-learner":false}
{"level":"info","ts":"2025-05-22T08:11:19.777901Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.777942Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.778255Z","caller":"rafthttp/stream.go:168","msg":"started stream writer with remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.779488Z","caller":"rafthttp/stream.go:168","msg":"started stream writer with remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.779700Z","caller":"rafthttp/stream.go:394","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7bfbc310c445a737","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.779690Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.779780Z","caller":"rafthttp/stream.go:394","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"7bfbc310c445a737","remote-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.779818Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"e0f0ce2cbce76612","remote-peer-urls":["http://10.136.16.101:2380"]}
{"level":"warn","ts":"2025-05-22T08:11:19.779877Z","caller":"membership/cluster.go:511","msg":"skipped attributes update of removed member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","updated-peer-id":"e0f0ce2cbce76612"}
{"level":"info","ts":"2025-05-22T08:11:19.779969Z","caller":"membership/cluster.go:646","msg":"updated cluster version","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","from":"3.6","to":"3.6"}
{"level":"info","ts":"2025-05-22T08:11:19.780273Z","logger":"raft","caller":"[email protected]/raft.go:1981","msg":"7bfbc310c445a737 switched to configuration voters=(16208681750449579538) learners=(8933948762599302967)"}
{"level":"info","ts":"2025-05-22T08:11:19.780344Z","caller":"membership/cluster.go:433","msg":"ignore already added member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","added-peer-id":"7bfbc310c445a737","added-peer-peer-urls":["http://10.136.16.184:2380"],"added-peer-is-learner":true}
{"level":"info","ts":"2025-05-22T08:11:19.780506Z","logger":"raft","caller":"[email protected]/raft.go:1981","msg":"7bfbc310c445a737 switched to configuration voters=(8933948762599302967 16208681750449579538)"}
{"level":"info","ts":"2025-05-22T08:11:19.780568Z","caller":"membership/cluster.go:558","msg":"ignore already promoted member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737"}
{"level":"info","ts":"2025-05-22T08:11:19.780651Z","logger":"raft","caller":"[email protected]/raft.go:1981","msg":"7bfbc310c445a737 switched to configuration voters=(8933948762599302967 16208681750449579538) learners=(4346234090465087378)"}
{"level":"info","ts":"2025-05-22T08:11:19.780701Z","caller":"membership/cluster.go:433","msg":"ignore already added member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","added-peer-id":"3c50ece0c0dd4f92","added-peer-peer-urls":["http://10.136.16.81:2380"],"added-peer-is-learner":true}
{"level":"info","ts":"2025-05-22T08:11:19.780719Z","caller":"rafthttp/peer.go:133","msg":"starting remote peer","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.780749Z","caller":"rafthttp/pipeline.go:72","msg":"started HTTP pipelining with remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.784958Z","caller":"rafthttp/stream.go:168","msg":"started stream writer with remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.784973Z","caller":"rafthttp/stream.go:168","msg":"started stream writer with remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.786085Z","caller":"rafthttp/peer.go:137","msg":"started remote peer","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.786129Z","caller":"rafthttp/transport.go:317","msg":"added remote peer","local-member-id":"7bfbc310c445a737","remote-peer-id":"3c50ece0c0dd4f92","remote-peer-urls":["http://10.136.16.81:2380"]}
{"level":"warn","ts":"2025-05-22T08:11:19.786154Z","caller":"membership/cluster.go:511","msg":"skipped attributes update of removed member","cluster-id":"1fff5ae36329cc03","local-member-id":"7bfbc310c445a737","updated-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.786160Z","caller":"rafthttp/stream.go:394","msg":"started stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"7bfbc310c445a737","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.786204Z","caller":"rafthttp/stream.go:394","msg":"started stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"7bfbc310c445a737","remote-peer-id":"3c50ece0c0dd4f92"}
{"level":"info","ts":"2025-05-22T08:11:19.786237Z","logger":"raft","caller":"[email protected]/raft.go:1981","msg":"7bfbc310c445a737 switched to configuration voters=(4346234090465087378 8933948762599302967 16208681750449579538)"}
{"level":"panic","ts":"2025-05-22T08:11:19.786312Z","caller":"schedule/schedule.go:202","msg":"execute job failed","job":"server_applyAll","panic":"runtime error: invalid memory address or nil pointer dereference","stacktrace":"go.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob.func1\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:202\nruntime.gopanic\n\truntime/panic.go:791\nruntime.panicmem\n\truntime/panic.go:262\nruntime.sigpanic\n\truntime/signal_unix.go:917\ngo.etcd.io/etcd/server/v3/etcdserver/api/membership.(*RaftCluster).PromoteMember\n\tgo.etcd.io/etcd/server/v3/etcdserver/api/membership/cluster.go:525\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2103\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1902\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1194\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:979\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func6\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:855\ngo.etcd.io/etcd/pkg/v3/schedule.job.Do\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:41\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).executeJob\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:206\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/[email protected]/schedule/schedule.go:187"}

reneradoi avatar May 22 '25 11:05 reneradoi

Assigning high priority to triage it as we just released v3.6

/cc @ahrtr

serathius avatar May 22 '25 11:05 serathius

Thanks @reneradoi .

Based on my quick triage, it might be a corner case specific to --force-new-cluster. etcd will automatically remove all other members on startup if --force-new-cluster is set, but if the members (to be removed) were not synced to v2store yet. Then etcd will crash when it tries to remove non-exist members. This isn't verified yet. @fuweid do you get time to reproduce & fix this? thx. If not, I will take a look next week.

ahrtr avatar May 22 '25 12:05 ahrtr

There are two issues.

HardState isn't always correctly synced to WAL file

All releases (including 3.4, 3.5, 3.6 and main) have this issue.

It may cause etcd to run into a situation ApplyIndex (consistent_index) > CommitIndex (read from WAL file). Eventually it may cause --force-new-cluster can't successfully remove all other members. Read https://github.com/etcd-io/raft/pull/300 to get more details.

Inconsistency between v2 and v3store.

Only release-3.6 and main have this isue.

Currently 3.6 still bootstrap from v2store (the latest snapshot), but it always operates the data of the latest state (read from v3store). It isn't a problem for v3store, because etcd will skip it when shouldApplyV3 == false. But it causes issue for v2store when replaying WAL files during bootstrap, because it may be inconsistent with v3store (note it isn't an issue during bootstrap phase). If the member being operated doesn't exist yet or already been removed, then it will run into nil pointer panicking.

ahrtr avatar May 26 '25 15:05 ahrtr

I just realized that backporting https://github.com/etcd-io/etcd/pull/20055 to release-3.5 and release-3.4 is a bit tricky, as some dependencies are missing in those branches.

Would anyone be able to help with backporting it to 3.5 and 3.4? Thanks in advance!

ahrtr avatar May 30 '25 22:05 ahrtr

@reneradoi etcd v3.6.1 is just released. It includes fix to this issue. Please feel free to try it out, and feedback if you can still reproduce this issue, thx

https://github.com/etcd-io/etcd/releases/tag/v3.6.1

ahrtr avatar Jun 06 '25 20:06 ahrtr

Thank you @ahrtr , really appreciate how quickly you all worked on this! I will pick up my work again in the next few days and give feedback here.

Thank you again!

reneradoi avatar Jun 10 '25 05:06 reneradoi

@ahrtr Thank you again for delivering this fix. I can not reproduce the issue anymore with v3.6.1, so we can consider it solved.

Great work!

reneradoi avatar Jun 13 '25 15:06 reneradoi

@ahrtr Thank you again for delivering this fix. I can not reproduce the issue anymore with v3.6.1, so we can consider it solved.

Great work!

Great! Thanks for the confirmation.

ahrtr avatar Jun 13 '25 15:06 ahrtr

Resolved in both 3.6 (3.6.1) and 3.5 ( 3.5.22 to be released).

Decided not to backport to 3.4. It needs big effort on the test refactoring. It should be fine, we should consider only backporting security and critical/major bug fixes to 3.4.

ahrtr avatar Jul 17 '25 18:07 ahrtr

@ahrtr, should we start planning release 3.5.22? Last time we released 3.5 was back in March, although there are no high vulnerabilities in the image, this could be the trigger to build the release.

ivanvc avatar Jul 18 '25 05:07 ivanvc

@ahrtr, should we start planning release 3.5.22?

I want to get the fix to https://github.com/etcd-io/etcd/issues/20351 included in 3.5.22, so that it can land into next K8s 1.33.x patch before we land 3.6.3 into K8s 1.34.0

Pls let wait one more week.

ahrtr avatar Jul 18 '25 08:07 ahrtr

Pls let wait one more week.

I'll start planning the release, as we also have merged #20290, which needs to be released soon. So, I'll open the issue to start gathering everything for the release.

ivanvc avatar Jul 18 '25 18:07 ivanvc