etcd icon indicating copy to clipboard operation
etcd copied to clipboard

Wrong raft messages may cause etcd panic

Open overvenus opened this issue 2 years ago • 3 comments

Bug report criteria

What happened?

Multiple etcd servers repeatedly panic with a message "tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"

The first panic occurred one month after the etcd started running. At that time, there were some network issues, leading to TCP reconnections.

What did you expect to happen?

No panic.

How can we reproduce it (as minimally and precisely as possible)?

It's hard to reproduce the panic in real world, but we have reproduced it in a unit test.

Anything else we need to know?

After investigation, we suspect that a malfunctioning switch or a similar issue may be dispatching incorrect Raft messages, leading to the panic. Currently, etcd does not check the To field in messages.

Etcd version (please run commands below)

We (tikv/pd) use embed etcd v3.4.21 https://github.com/etcd-io/etcd/commit/85b640cee793

Etcd configuration (command line flags or environment variables)

Default settings.

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

[2023/10/30 14:48:34.402 +08:00] [WARN] [stream.go:266] ["closed TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.402 +08:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.409 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.409 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=1baed0b840546f7]
[2023/10/30 14:48:34.410 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.410 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=1baed0b840546f7]
[2023/10/30 14:48:34.419 +08:00] [PANIC] [log.go:203] ["tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:48:34.419 +08:00] [PANIC] [log.go:203] ["tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:48:54.770 +08:00] [PANIC] [log.go:203] ["tocommit(4432459) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:13.070 +08:00] [PANIC] [log.go:203] ["tocommit(4432492) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:31.681 +08:00] [PANIC] [log.go:203] ["tocommit(4432521) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:49.816 +08:00] [PANIC] [log.go:203] ["tocommit(4432522) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]


panic: tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?

goroutine 249 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00456e9c0, {0x0, 0x0, 0x0})
        /go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*SugaredLogger).log(0xc0002bad50, 0x4, {0x2b11764?, 0xdcbc85?}, {0xc007dac240?, 0x26ef300?, 0xc0098aa300?}, {0x0, 0x0, 0x0})
        /go/pkg/mod/go.uber.org/[email protected]/sugar.go:227 +0xee
go.uber.org/zap.(*SugaredLogger).Panicf(...)
        /go/pkg/mod/go.uber.org/[email protected]/sugar.go:159
go.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf(0x439fec?, {0x2b11764?, 0x2b?}, {0xc007dac240?, 0x1672334?, 0x3cd6081?})
        /go/pkg/mod/go.etcd.io/[email protected]/pkg/logutil/zap_raft.go:101 +0x49
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc000760150, 0x43a001)
        /go/pkg/mod/go.etcd.io/[email protected]/raft/log.go:203 +0x103
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
        /go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1389 +0x45
go.etcd.io/etcd/raft.stepCandidate(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
        /go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1289 +0x585
go.etcd.io/etcd/raft.(*raft).Step(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
        /go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:990 +0x1055
go.etcd.io/etcd/raft.(*node).run(0xc000118d80)
        /go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:354 +0xa05
created by go.etcd.io/etcd/raft.RestartNode
        /go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:242 +0x24a

overvenus avatar Dec 07 '23 15:12 overvenus

https://github.com/etcd-io/etcd/pull/17078#issuecomment-1845324673

ahrtr avatar Dec 07 '23 15:12 ahrtr

sry,when will backport the fix to 3.5 @ahrtr @overvenus

zhan81776075 avatar Feb 21 '24 09:02 zhan81776075

Please feel free to raise two PRs, one for 3.5 and the other for 3.4. thx

ahrtr avatar Feb 21 '24 10:02 ahrtr

Hey @ahrtr I can attempt the backport! :)

/assign

henrybear327 avatar Apr 17 '24 10:04 henrybear327

@ahrtr I think we can close this issue?

henrybear327 avatar Apr 23 '24 19:04 henrybear327

Yes, with fix being backported to all branches and changelog updated we can close.

serathius avatar Apr 24 '24 07:04 serathius