gravity icon indicating copy to clipboard operation
gravity copied to clipboard

etcd panics with corrupted raft log errors

Open r0mant opened this issue 4 years ago • 3 comments

Describe the bug

We've seen a few times where etcd fails to start on a master node with the following panic:

May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: tocommit(65072) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: panic: tocommit(65072) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: goroutine 135 [running]:
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc0001deb40, 0xfd977a, 0x5d, 0xc0001a82c0, 0x2, 0x2)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:83 +0x135
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc0001fa770, 0xfe30)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/log.go:191 +0x131
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:1195 +0x54
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.stepFollower(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:1141 +0x404
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*raft).Step(0xc00028a500, 0x8, 0x3c157c6302242672, 0x6c1ddcb80fda6234, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/raft.go:869 +0x1376
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: github.com/coreos/etcd/raft.(*node).run(0xc000448060, 0xc00028a500)
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/node.go:323 +0xf40
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]: created by github.com/coreos/etcd/raft.RestartNode
May 21 16:29:42 dmitri-centos-node-2 etcd[5481]:         /home/ANT.AMAZON.COM/leegyuho/go/src/github.com/coreos/etcd/raft/node.go:223 +0x31c
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: etcd.service: Failed with result 'exit-code'.
May 21 16:29:42 dmitri-centos-node-2 systemd[1]: Failed to start Etcd Service.

The only way to get out of this situation is to rebuild the faulty etcd member like described in RedHat's KB article: https://access.redhat.com/solutions/4145881.

The issue was observed on etcd v3.3.11 (Gravity 5.5.40) and v3.3.20.

There are also related issues on etcd Github issue tracker, none of them seems to provide any resolution/RCA though, e.g. https://github.com/etcd-io/etcd/issues/10951 or https://github.com/etcd-io/etcd/issues/10817 (which seems to indicate the issue can be triggered by reboot).

To Reproduce

Unclear, it's intermitted and seems to happen under a couple of circumstances:

  • In internal testing this was seen when adding a new node to the cluster - etcd panicked on the joined node.
  • In the field, possibly the node was restarted?

We should try to reproduce it on 5.5.40.

Expected behavior

Etcd doesn't panic.

Logs

Environment (please complete the following information):

  • OS [e.g. Redhat 7.4]:
  • Gravity [e.g. 5.5.4]: 5.5.40
  • Platform [e.g. Vmware, AWS]:

Additional context

r0mant avatar Jun 01 '20 23:06 r0mant

I did a cursory glance at some issues, and the ones I've found appear to suggest that this can occur when the etcd files are legitimately missing or corrupted. So I think we'd need to take a closer look at the systems we've seen this occur on, to see what the state of the filesystem is. It used to be a big issue where certain hardware would lie about fsyncs being completed, so that might be something else to consider since one customer ha it occur multiple times.

References:

  • https://github.com/etcd-io/etcd/issues/11918
  • https://github.com/etcd-io/etcd/issues/10254

knisbet avatar Jun 02 '20 15:06 knisbet

I modified etcd/raft/raft.go , just ignore the panic and then this problem fixed, but i am not sure whether this is ok. @r0mant

`func (r *raft) handleHeartbeat(m pb.Message) {

  defer func() {
	e := recover()
	if e != nil{
		return
	}
}()
r.raftLog.commitTo(m.Commit)
r.send(pb.Message{To: m.From, Type: pb.MsgHeartbeatResp, Context: m.Context})

}`

hanfengyizu avatar Mar 19 '21 02:03 hanfengyizu

Well that panic AFAIK is telling you etcd is about to lose data and some sort of file corruption has occurred, so recovering and ignoring the panic are at best likely to lose data and at worst leave things in an inconsistent state that further goes out of control. I really wouldn't try and make a core logic change like that without talking with the etcd developers.

knisbet avatar Mar 19 '21 03:03 knisbet