Configuration change validation has false positives
We currently validate configuration changes at proposal time
https://github.com/etcd-io/raft/blob/4abd9e927c6d5db930dfdb80237ac584449aeec7/raft.go#L1224-L1260
This is not very helpful because it has false positives (refusing a config change that is actually allowed) though at least it currently doesn't have false negatives (because the set of false positives is sufficiently large 😄)
It could be more effective to either compare against the actual most recent config change in the (including the unstable) log, or to move the verification to apply time (i.e. erroring out conf changes that are illegal).
See https://github.com/etcd-io/raft/pull/81.
in #81 (which adds a DisableConfChangeValidation option), @pavelkalinnikov says:
I think we should fix this lag in raft more fundamentally, in such a way that the DisableConfChangeValidation option is not needed. This mechanism should just not be best-effort, and should not have false positives/negatives. Config changes need some reconsideration.
Relying solely on above-raft config change mechanisms (such as in CRDB) also smells like a footgun. Config changes seem fundamentally belonging to raft layer, at least w.r.t. correctness of quorum rules etc.
Currently we do a collection of tricks which makes correctness reasoning hard:
- scan the log to see if we have an unapplied config
- guess if we have a pending config change
- silently nullify incorrect config changes, or silently accept them (with the
DisableConfChangeValidationoption on).
Instead, I think we should just precisely track and know the current and prospective config from the explicit state, and explicitly error out on incorrect transitions. I.e., when we commit a pending config change, it becomes part of the state. To do so, we would need to decouple config changes from logs (otherwise, again, we would have to have a "scan the log" step to find the next config, e.g. when entries are appended or the commit index is advanced).
More generally, it would be nice to have separate state machines: raft itself, and the application-level state machine that it supports.
There are more arguments for this decoupling:
- Config changes are the only entries that
raftneeds to see to operate correctly. All other log entries (which often include user data etc) don't need to go throughraftpackage, it's enough to communicate metadata such as term/index of stored entries. - The application-level commands thus don't have to be embedded into and carried with the protobufs format that
raftpackage imposes. - The application level thus doesn't have to demultiplex raft config changes and its own logs.
- Performance reason. Say, there is a leader and 2 slow followers who are close to stall. It is nice to be able to relocate the two followers to other nodes without waiting for them to catch up. Currently we have to commit a config change through this slow quorum's log, but with decoupling it should be possible to do it bypassing the log, with a "higher priority".
raftdoesn't look at application-level commands (which is most of them), but plays disproportionate role in managing resources and driving the flow of application-level entries: #64. The control should be at the application end. Having app-level commands interleaved with config changes complicates this control though, but decoupling should help.