etcd
etcd copied to clipboard
Wrong raft messages may cause etcd panic
Bug report criteria
- [X] This bug report is not security related, security issues should be disclosed privately via [email protected].
- [X] This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- [X] You have read the etcd bug reporting guidelines.
- [X] Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
Multiple etcd servers repeatedly panic with a message "tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"
The first panic occurred one month after the etcd started running. At that time, there were some network issues, leading to TCP reconnections.
What did you expect to happen?
No panic.
How can we reproduce it (as minimally and precisely as possible)?
It's hard to reproduce the panic in real world, but we have reproduced it in a unit test.
Anything else we need to know?
After investigation, we suspect that a malfunctioning switch or a similar issue may be dispatching incorrect Raft messages, leading to the panic. Currently, etcd does not check the To field in messages.
Etcd version (please run commands below)
We (tikv/pd) use embed etcd v3.4.21 https://github.com/etcd-io/etcd/commit/85b640cee793
Etcd configuration (command line flags or environment variables)
Default settings.
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
[2023/10/30 14:48:34.402 +08:00] [WARN] [stream.go:266] ["closed TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.402 +08:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.409 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.409 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=1baed0b840546f7]
[2023/10/30 14:48:34.410 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.410 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=1baed0b840546f7]
[2023/10/30 14:48:34.419 +08:00] [PANIC] [log.go:203] ["tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:48:34.419 +08:00] [PANIC] [log.go:203] ["tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:48:54.770 +08:00] [PANIC] [log.go:203] ["tocommit(4432459) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:13.070 +08:00] [PANIC] [log.go:203] ["tocommit(4432492) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:31.681 +08:00] [PANIC] [log.go:203] ["tocommit(4432521) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:49.816 +08:00] [PANIC] [log.go:203] ["tocommit(4432522) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
panic: tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?
goroutine 249 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00456e9c0, {0x0, 0x0, 0x0})
/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*SugaredLogger).log(0xc0002bad50, 0x4, {0x2b11764?, 0xdcbc85?}, {0xc007dac240?, 0x26ef300?, 0xc0098aa300?}, {0x0, 0x0, 0x0})
/go/pkg/mod/go.uber.org/[email protected]/sugar.go:227 +0xee
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/go/pkg/mod/go.uber.org/[email protected]/sugar.go:159
go.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf(0x439fec?, {0x2b11764?, 0x2b?}, {0xc007dac240?, 0x1672334?, 0x3cd6081?})
/go/pkg/mod/go.etcd.io/[email protected]/pkg/logutil/zap_raft.go:101 +0x49
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc000760150, 0x43a001)
/go/pkg/mod/go.etcd.io/[email protected]/raft/log.go:203 +0x103
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1389 +0x45
go.etcd.io/etcd/raft.stepCandidate(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1289 +0x585
go.etcd.io/etcd/raft.(*raft).Step(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:990 +0x1055
go.etcd.io/etcd/raft.(*node).run(0xc000118d80)
/go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:354 +0xa05
created by go.etcd.io/etcd/raft.RestartNode
/go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:242 +0x24a
https://github.com/etcd-io/etcd/pull/17078#issuecomment-1845324673
sry,when will backport the fix to 3.5 @ahrtr @overvenus
Please feel free to raise two PRs, one for 3.5 and the other for 3.4. thx
Hey @ahrtr I can attempt the backport! :)
/assign
@ahrtr I think we can close this issue?
Yes, with fix being backported to all branches and changelog updated we can close.