etcd raft: Don't panic on commit index regression

We occasionally see issues in production in which writes to our raft log are not fully persisted before MsgAppResps are sent (Sometimes this is deliberate, as when users disable fsync for performance. Sometimes it's a bug in some layer or another). The raft package does not handle this gracefully: if a follower has sent an MsgAppResp for a given index, the leader will send it that index as committed in future MsgApps. If the raft log was not written successfully, this will cause the follower to crash with panic: tocommit(265625) is out of range [lastIndex(265624)]. Was the raft log corrupted, truncated, or lost?.

The message is correct because the log was truncated, and this behavior could result in a loss of committed writes if it happened on multiple nodes. However, this is a very severe failure mode. The follower panics, and will repeatedly panic until its disk is wiped and it is started from scratch. In most cases a less extreme recovery path is possible. If the leader still has the log entries in question, it can simply re-send them, and if it does not, it can send a snapshot. I'd like to at least have the option to disable this panic and recover a node in-place when it reaches this state.

I haven't thought this all the way through, but I think the solution would involve synthesizing an MsgAppResp with Reject: true and an appropriate IndexHint when the panic would be triggered.

Oct 09 '18 14:10 bdarnell

/cc @jpbetz

Oct 11 '18 18:10 wenjiaswe

I'd like to at least have the option to disable this panic and recover a node in-place when it reaches this state.

how about adding a corrupted hint in ready or something like that? so the upper layer can do something when raft thinks itself is in a bad state?

Oct 12 '18 22:10 xiang90

Maybe (that would be an alternative to the panics we have all over the place now), but I think it might be difficult to use correctly in this case. Here we know we're missing information but we can recover if we get the right log entries. "Corruption" in the general case is difficult (if not impossible) to handle without just crashing and giving up.

Oct 15 '18 15:10 bdarnell

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Apr 07 '20 04:04 stale[bot]

Follow up from the community meeting to revive this conversation. @serathius @lavacat

Another use case not to panic right away:

add a new member with etcd db snapshot to speed up time ready to accept client traffic.

To support this, a few more things may need to be changed. etcd db snapshot as backup right now is only for creating a fresh new cluster with that data as disaster recovery mechanism.

Jun 30 '22 19:06 chaochn47

For ref, here is original commit that added commitTo with panic https://github.com/etcd-io/etcd/pull/1740

Jul 01 '22 06:07 lavacat

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Dec 31 '22 23:12 stale[bot]

@tbg Is this issue still valid? Please feel free to raise an issue in the new raft repo if needed.

Dec 31 '22 23:12 ahrtr

I raised it here again: https://github.com/etcd-io/raft/issues/18 Thanks for the heads up.

Jan 04 '23 08:01 tbg

etcd etcd copied to clipboard

raft: Don't panic on commit index regression

etcd
etcd copied to clipboard