[NEW] Faster cluster failover
In very fast networks, we don't need the hard-coded 500ms delay. Can we change these hard-coded numbers to be relative the configured node timeout?
This is to have less downtime during an automatic failover.
~~Faster manual failover during upgrade and similar. We can skip the delay that was intended for automatic failover.~~
~~The hard-coded delays may be useful for automatic failover to make it unlikely that two replicas initiates failover at the same time, but this is not useful in manual failover. We can make manual failover faster if we just skip this delay in manual failover.~~
server.cluster->failover_auth_time = now +
500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */
/* We add another delay that is proportional to the replica rank.
* Specifically 1 second * rank. This way replicas that have a probably
* less updated replication offset, are penalized. */
server.cluster->failover_auth_time += server.cluster->failover_auth_rank * 1000;
@madolson @enjoy-binbin @hpatro Am I missing anything?
Slightly below the lines which you've referenced, we cleanup the failover_auth_time if it's a manual failover, so that the failover is performed immediately.
https://github.com/valkey-io/valkey/blob/8df0a6b67587cb5505d628d76e5c83532f4a4c39/src/cluster_legacy.c#L4962-L4969
I guess we can flip the logic and not compute any delay if it's for manual failover to avoid the above confusion.
Oh, my bad. :)
I got this from someone else and I think I got it wrong. Instead, for automatic failover, can we change these hard-coded numbers? In a very fast network, we don't need the fixed 500 ms.
Can we make these be calculated from the node timeout or something instead? Lower node timeout can result in lower failover delay?
Can we make these be calculated from the node timeout or something instead? Lower node timeout can result in lower failover delay?
Yeah, better than hardcoded values.
Sorry i missed the thread. And yes, i think we indeed can adjust the number.
In #1018, i use the same 500 number to add the delay by default to avoid the conflict, i forgot if i ever mention that to see if we can set the 500 as a configuration item. Do you have a good way to do the calculate by node timeout? Or do you think we can make it become a configuration item (so that the admin can control this logic to avoid the conflict and make it more faster)?
Do you have a good way to do the calculate by node timeout?
Just an idea: node_timeout / 30. If gives us 500 for the default node timeout 15000.
For node timeout 3000, we get 100 ms. Seems ok?
Or do you think we can make it become a configuration item
If it's risky to change users' current behavior, then maybe a new config is better. I want to avoid configs though because it makes it more difficult for users to configure everything.
This is related to network latency just like node timeout, so i hope it can be controled by the same config.
Just an idea: node_timeout / 30. If gives us 500 for the default node timeout 15000. For node timeout 3000, we get 100 ms. Seems ok?
this looks like a good way to do the calc.
This is related to network latency just like node timeout, so i hope it can be controled by the same config.
That is a good point. But sometimes node timeout is not only related to the network, i have indeed seen too many cases where bad slowlog casue the node to timeout, so for some developers, node-timeout is a time number we will take for a cluster to recover after a bad situation.
So when they have some very slow lua script, etc. they increase the node timeout to compensate?
We can make it maximum 500, i.e. min(node_timeout / 30, 500).
If there is a concern for low numbers, that 100 is too low for node timeout 3000, then maybe we can make it something like min((node_timeout + 5000) / 40, 500). It gives us 500 for 15000, but 200 for 3000. (The lower bound is 125 for node timeout 0.)
Or we can make a config with default = auto.
So when they have some very slow lua script, etc. they increase the node timeout to compensate?
No, they usually don't know that they will send slow commands in the future. And we don't expose this configuration item to they too.
A max seems to be a good idea. I'm not actually worried about it being too low. I believe this delay is more to avoid conflicts. We do a lot of other things to avoid conflicts, like some rank order thing.
Speaking of node timeout.
Or, if a node is blocked in a slow command, is it really dead?
I have also seen some users who just need to execute slow commands and don't care about blocking (not used in online services). In these cases, frequent auto failovers can cause harm (we do failover, add a new node, drop the old node, etc..). In these cases, we have to negotiate with users for a larger node timeout to avoid the auto failover. But on the other hand, we are afraid that when the node is really dead, a large node timeout will lead to slower recovery. I once thought of moving the cluster ping-pong detection away from the main thread.
500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
I see one of the 500, the main purpose is to propagate the FAIL so that the other primary wont reject the vote (primary will reject the vote it they think the primary is up). The question is, if myself is a replica and find out my primary is FAIL, that is, this means that another primary node has sent a FAIL to myself, which means that the primary node has achieved quorum from its view, which means that a quorum of nodes in the cluster are aware of the FAIL, so we don’t actually need to wait for the FAIL to propagate. (In normal case, if myself can get a FAIL, the other nodes should have received FAIL as well, or should even have received FAIL before FAILOVER_AUTH)
It sounds like we could just remove it.
===============
It should be possible though, that some nodes still think the dead primary is in PFAIL state. So maybe if the node is pfail, we can also allow voting? (Though this may lead to split-brain i guess in a bad case)
} else if (!nodeFailed(primary)) {
serverLog(LL_WARNING, "Failover auth denied to %.40s (%s) for epoch %llu: its primary is up", node->name,
node->human_nodename, (unsigned long long)requestCurrentEpoch);
}
Speaking of node timeout.
Or, if a node is blocked in a slow command, is it really dead?
I have also seen some users who just need to execute slow commands and don't care about blocking (not used in online services). In these cases, frequent auto failovers can cause harm (we do failover, add a new node, drop the old node, etc..). In these cases, we have to negotiate with users for a larger node timeout to avoid the auto failover. But on the other hand, we are afraid that when the node is really dead, a large node timeout will lead to slower recovery. I once thought of moving the cluster ping-pong detection away from the main thread.
At the end of the day, I think we all see the same set of issues I guess 😅. And I had the same thought of doing the health check outside the main thread. Some of the idea were captured here as well https://github.com/valkey-io/valkey/issues/1893
Node timeout is 15 seconds by default. I guess it is a good tradeoff but users need to configure it for their use case.
Offline data processing with slow commands can set a very high number.
And some users maybe don't know that they are using slow commands. For them maybe the default node timeout of 15 seconds is good.
But some critical systems like ours can't accept much downtime, so we want to optimize these times. But very low node timeout also means more cluster bus traffic. This 500ms seems to be easy to skip. When node timeout is 1 second, this 500ms is significant.
I think min(node_timeout / 30, 500) can be a good number. I hope it's safe to do it without a config.
When node timeout is 1 second, this 500ms is significant.
Make sense, just curious, what is the minimum timeout you have set in the real world?
I think min(node_timeout / 30, 500) can be a good number. I hope it's safe to do it without a config.
Yes, i think that will be safe, after the thinking https://github.com/valkey-io/valkey/issues/2023#issuecomment-2965021139. I even think it might be safe to remove it, let me test it sometime