raft icon indicating copy to clipboard operation
raft copied to clipboard

Allow recovering from some assertions

Open tbg opened this issue 2 years ago • 1 comments

Consider an assertion such as

https://github.com/etcd-io/raft/blob/d9907d6ac6baaebc3c9fd4e67acaa4154d2b3cd3/log.go#L324

of which there are various across the codebase.

Hitting this panic usually means that a follower has not upheld its durability guarantees.

Violating invariants like these is not great, but crashing might not be the user's first choice here. An app might prefer to keep going despite some amount of risk that a write was lost (which often it won't have been).

The way I would structure this is by introducing event-based logging:

Instead of a line like this

https://github.com/etcd-io/raft/blob/d9907d6ac6baaebc3c9fd4e67acaa4154d2b3cd3/log.go#L324

We'd have something like this (total strawman just to get the idea across)

l.logger.Event(&CommitOutOfRangeIndex{Commit: tocommit, LastIndex: l.lastIndex()})
// Code here to actually handle the problem gracefully
...

where the default logger would panic but users could define a logger that would just log the event and keep going. We wouldn't have to make all events that are now panics recoverable at first but could allow this only for certain events like the one discussed here.

Extracted from https://github.com/etcd-io/raft/issues/25#issuecomment-1449055381_


Note that while "help is wanted" here I don't have bandwidth to shepherd a pull request from humble beginnings to the end. Unless another maintainer steps up to "sponsor" this work I'll only be able to accept contributions that are "close enough" to a solution that passes the bar: good design, testing, sensibly documented, backwards compatible. This will be difficult for casual or even first-time contributors.

tbg avatar Feb 28 '23 23:02 tbg

I am trying to solve the commit index regression problem first. I am now writing an interaction test to reproduce the problem.

CaojiamingAlan avatar May 16 '23 02:05 CaojiamingAlan