nats-server
nats-server copied to clipboard
NRG (2.11): Ignore vote requests if leader heard more recently than election timeout
This is actually a safeguard that the Raft paper describes as a way of avoiding network partitioned nodes from coming back up with a high term number and causing the existing leader to step down unnecessarily. In the time that the isolated node was isolated, it is likely to have sat timing out constantly and increasing its term number with each new election attempt.
Section 6 "Cluster membership changes":
The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.
To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.
TODO: the TestNRGSimpleElection test fails with this, need to investigate if this has changed step-down behaviour.
Signed-off-by: Neil Twigg [email protected]
I'd suggest the following logic (very close to what you did, but not exactly the same)
New state variable: leaderLastAETimestamp
Every time you get an AE from the node you think is leader:
leaderLastAETimestamp = time.Now()
When you get a vote request (regardless of the state you are in):
if [time.Since(leaderLastAETimestamp) < some threshold] then ignore request
(the conditional above may need a special case for voluntary leader transitions)
Not convinced by this approach so closing, and AEs from the current leader will now cause the once-isolated node to switch to follower again as of #5481.