[NEW] Trigger manual failover on SIGTERM to primary (cluster)
The problem/use-case that the feature addresses
When a primary disappears, its slots are not served until an automatic failover happens. It takes about 3 seconds (node timeout plus some second). It's too much time for us to not accept writes.
If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default.
Description of the feature
When a primary receives a SIGTERM, let it trigger a failover to one of the replicas as part of the graceful shutdown.
Alternatives you've considered
Our current solution is to have a wrapper process, a small script, starting valkey. Then, this process receives the SIGTERM and can handle this. It's like a workaround though. It's better to have this built in.
Additional information
We use cluster.
This is a good idea. In my fork, i have a similar feature like this (detect a disk error and do failover). The basic idea is that, the primary will pick a best replica and send a CLUSTER FAILOVER to it (and wait for it). Do you like this approach or need me to try it?
the primary will pick a best replica and send a CLUSTER FAILOVER to it (and wait for it). Do you like this approach
Yes, this is strait-forward. I like it.
I have another idea, similar but maybe faster(?) but more complex(?). This is the idea: The primary first pauses writes, then waits for the replica to replicate everything and then sends CLUSTER FAILOVER FORCE. This avoids step 1 below. This is from the docs of CLUSTER FAILOVER:
The replica tells the master to stop processing queries from clients.
The master replies to the replica with the current replication offset.
The replica waits for the replication offset to match on its side, to make sure it processed all the data from the master before it continues.
The replica starts a failover, obtains a new configuration epoch from the majority of the masters, and broadcasts the new configuration.
The old master receives the configuration update: unblocks its clients and starts replying with redirection messages so that they’ll continue the chat with the new master.
And for FORCE:
If the FORCE option is given, the replica does not perform any handshake with the master, that may be not reachable, but instead just starts a failover ASAP starting from point 4. This is useful when we want to start a manual failover while the master is no longer reachable.
The primary first pauses writes, then waits for the replica to replicate everything and then sends CLUSTER FAILOVER FORCE.
yeah, this seems ok to me, faster.
This different i think is, one is that the replica thinks the offset is ok and start the failover, and the other is that the primary tells the replica that it can start the failover.
The CLUSTER FAILOVER one:
- Primary detect a SIGTERM and then pick a best replica to send CLUSTER FAILOVER in serverCron, 100ms a time.
- Replica receives a CLUSTER FAILOVER and tells primary to stop the write (and repsonse the offset), and wait the offset become ok. (in clusterCron, 100ms a time)
- Replica start the failover
primary serverCron, primary send CLUSTER FAILOVER, replcia send a MFSTART, primary send a PING, replica clusterCron and start the failover. 100ms + a command + a mfstart + a ping + 100ms
The CLUSTER FAILOVER FORCE one:
- Primary detect a SIGTERM and then stop the write, and then send the REPLCONF GETACK to all replicas and wait the response.
- Primary receives the REPLCONF ACK, check replica->repl_ack_off and primary_repl_offset, if match, send the CLUSTER FAILOVER FORCE to the replica.
- Replica start the failover.
primary serverCron, primary send REPLCONF GETACK, replica send REPLCONF ACK, primary send CLUSTER FAILOVER FORCE, replica clusterCron and start the failover 100ms + a command + a command + a command + 100ms
In shutdown, we already have a feature to stop writes and wait for replicas before shutting down. These two features will need to be combined. Let's discuss the details in a PR. :) Do you want to implement this?
yeah, i can do it in this week.
i dont have enouth time to write the test right now, i guess you might want to take a look at it in advance, so here is the commit https://github.com/enjoy-binbin/valkey/commit/9777d01084ea8c1f91e3c1c2b43123711392a472
i did a small manual testing locally and it seems to be work, i will try to find time to finish the test code and the rest of it.
Don't worry. I hope we can have it for 8.2 so you have a few months to finish it. :smile_cat:
Any update on this? This could simplify lives of folks like me hosting Valkey clusters themselves.
haven't find time to come back here lately, will pick it up again next week
@enjoy-binbin, @hwware Should we consider failover-on-shutdown for standalone and sentinel too? For that, we could use the regular FAILOVER command. In this comment https://github.com/valkey-io/valkey/issues/1355#issuecomment-2791404384, there is an example of a script that runs in the same container as Valkey and that can trigger a failover on shutdown for standalone mode.
@enjoy-binbin, @hwware Should we consider failover-on-shutdown for standalone and sentinel too? For that, we could use the regular FAILOVER command. In this comment #1355 (comment), there is an example of a script that runs in the same container as Valkey and that can trigger a failover on shutdown for standalone mode.
I have no objections for standalone mode, because the failover mechanism is the same.
I thought we were abandoning sentinel, i am ok with both though, as long as there is value in it, i think we can also support it.
I thought we were abandoning sentinel, i am ok with both though, as long as there is value in it, i think we can also support it.
OK, I think the same. I don't want to prioritize sentinel and standalone primary-replica without sentinel, but if someone wants to contribute it, I think we can accept it.