raft
raft copied to clipboard
Reduce chance of duelling leaders
Currently if multiple replicas concurrently try to recover from a failed leader, they can collide and must retry. This is likely to occur in well behaved networks, since replicas will detect the fault at similar times. It is also exacerbated as the number of replicas increase.
This PR uses randomness to reduce the chance of collisions by spreading out the candidates over $2^{16}$ terms, and effectively electing the candidate which chooses the highest term.
Since this is equivalent to each candidate being partitioned from the cluster until it hits its chosen term, there should be no interactions with the rest of the consensus protocol.
The downside of this approach is that if some candidate with a higher term becomes a candidate after a lower termed candidate is already elected, it will be prolong the outage. This case is less likely to occur if PreVote is enabled.
The graph below compares time to recovery when the leader fails, which we define as the maximum interval without any committed requests.
Nice. Interestingly, this change makes Raft behave more like vanilla Paxos.
Paxos assigns unique terms by construction, all candidates use non-intersecting terms. Correspondingly, if there is contention during phase 1 (equivalent to Raft election), the highest term wins.
Raft can have multiple candidates requesting votes for the same term, and has to retry in case of split votes. This PR reduces the chance of split votes, and makes the candidate with the latest term win election.
Both policies have pros/cons, we can't swap one for another without consideration. I think election should behave like the old code by default, for backwards compatibility. But we can make it possible to opt in for the new policy.
One way to do that: make the number of random bits at the end of Term
configurable. By default it's zero, so there are no random bits, and "next term" just returns +1.
Aside of the improvement, I would be also pretty interested in the methodology you used to measure the maximum recovery time. Not sure if we already have a tool for this, but would be great if we can document the performance difference and methodology to reproducing the measurement.
How does this interact with CheckQuorum+PreVote? Leader stability can be important for availability under certain kinds of network partitions -- it looks like this will still respect the leader recency condition when granting prevotes and votes, so I think it should be fine.
With this change, would it make sense to drop the random 1x-2x factor of the election timeout, and simply use a fixed election timeout, relying on the term randomness to break ties? This would speed up elections on average. A delayed candidate can also upset an already elected leader by sending out vote requests with higher terms. That scenario would still be possible due to network latency alone, so we could end up electing multiple leaders in rapid succession until it settles on the candidate with the highest term -- I think that should be fine, as far as I can tell, but it's worth thinking through the implications here.
@pavelkalinnikov Yep this is exactly where the idea came from (I was doing some comparisons of leader election performance between multipaxos and raft). Also that sounds like a good idea regarding how to merge this without breaking the internet :)
@serathius The measurements were taken using this tool: https://github.com/cjen1/reckon which is reasonably lightweight and very reproducible for these kind of tests (single machine with an emulated network). Here are some slides from a talk I gave on this work: https://coseners.net/wp-content/uploads/2023/08/Coseners-MP-vs-Raft-.pdf The tool is currently usable, but if you have any suggestions on UX I'd appreciate it!
@erikgrinaker My understanding was that etcd doesn't do the random exponential backoff when leaders duel? If that exists currently it can still be useful for leader stability if timeouts are set too low by a user. Otherwise I haven't looked into testing it with PreVote, or intermittent network conditions (see the 2020 cloudflare outage).
My understanding was that etcd doesn't do the random exponential backoff when leaders duel?
It doesn't do exponential backoff, but each node's election timeout is randomized by a factor of [1-2) to reduce the chance of ties:
https://github.com/etcd-io/raft/blob/ee0fe9da492888b55fe183cf1a42931ad551ec6b/raft.go#L413-L416
https://github.com/etcd-io/raft/blob/ee0fe9da492888b55fe183cf1a42931ad551ec6b/raft.go#L1989-L1991
It doesn't do exponential backoff, but each node's election timeout is randomized by a factor of [1-2) to reduce the chance of ties:
Ah! Then randomisation of the terms would work instead of the randomisation of the timeout.
This PR might lead to subtle availability issues. Under the existing election timeout mechanism, each member only votes once in each term, accordingly at most one leader will be elected in each term. But in this PR, multiple leaders may be elected in a short period, because each member may vote multiple times (due to each randomised term). From application perspective, the leader changes multiple times in a short period in such situation.
Another minor concern is this PR also make the term a little human unreadable. Users have to extract the high 48 bits to get the real term.
The existing election timeout mechanism is a tradeoff between understandability and efficiency. I am not fully convinced we should proceed with this approach. The bottom line is that we should add a flag to enable or disable the new behavior. But again it complicates the user API, for we already have too many such flags.
Interestingly, It could be more beneficial if the generation of the next term could vary based on the freshness of the log(log term, log index).