valkey Trigger manual failover on SIGTERM / shutdown to cluster primary

When a primary disappears, its slots are not served until an automatic failover happens. It takes about n seconds (node timeout plus some seconds). It's too much time for us to not accept writes.

If the host machine is about to shutdown for any reason, the processes typically get a sigterm and have some time to shutdown gracefully. In Kubernetes, this is 30 seconds by default.

When a primary receives a SIGTERM or a SHUTDOWN, let it trigger a failover to one of the replicas as part of the graceful shutdown. This can reduce some unavailability time. For example the replica needs to sense the primary failure within the node-timeout before initating an election, and now it can initiate an election quickly and win and gossip it.

This closes #939.

Sep 30 '24 04:09 enjoy-binbin

Codecov Report

:x: Patch coverage is 88.23529% with 8 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 71.09%. Comparing base (dfdcbfe) to head (3b18c45). :warning: Report is 337 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	85.00%	6 Missing :warning:
src/replication.c	92.59%	2 Missing :warning:

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1091      +/-   ##
============================================
+ Coverage     71.02%   71.09%   +0.06%     
============================================
  Files           123      123              
  Lines         65683    65766      +83     
============================================
+ Hits          46653    46754     +101     
+ Misses        19030    19012      -18

Files with missing lines	Coverage Δ
src/config.c	`78.39% <ø> (ø)`
src/server.c	`87.57% <100.00%> (+0.02%)`	:arrow_up:
src/server.h	`100.00% <ø> (ø)`
src/replication.c	`87.32% <92.59%> (+0.05%)`	:arrow_up:
src/cluster_legacy.c	`86.08% <85.00%> (+0.05%)`	:arrow_up:

... and 13 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sep 30 '24 04:09 codecov[bot]

The PR description can be updated to explain the solution. Now it is just copy-pasted from the issue. :)

The issue desc is good and very detailed so i copied it, i will update it later.

I'm thinking that doing failover in finishShutdown() is maybe too late. finishShutdown is only called when all replicas already have replication offset equal to the primary (checked by isReadyToShutdown()), or after timeout (10 seconds). If one replica is very slow, it will delay the failover. I think we can do the manual failover earlier.

yean, a failover as soon as possible is good, but itn't true that the primary is down only after it actually exit? so in this case, if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I think we can send CLUSTER FAILOVER FORCE to the first replica which has repl_ack_off == primary_repl_offset. We can do it in isReadyToShutdown() I think. (We can rename to indicated it does more then check if ready.) Then, we also wait for it to send failover auth request and the primary votes before isReadyToShutdown() returns true.

so let me sort it out again, you are suggesting that if one replica has already caught up the offset, we should trigger a failover immediately?

I guess it is also make sense in this case.

Oct 06 '24 06:10 enjoy-binbin

if a replica is slow and it does not have the chance to catch up the primary, and then the other replica trigger the failover, so the slow replica will need a full sync when it doing the reconfiguration.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

Oct 06 '24 19:10 zuiderkwast

Sorry for the late reply, i somehow missed this thread.

I didn't think about this. The replica can't do psync to the new primary after failover? If it can't, then maybe you're right that the primary should wait for all replicas, at least for some time, to avoid full sync.

Yes, i think this may happen, like if the primary does not flush its output buffer to the slow replica, like primary does not write the buffer to the slow replica, when doing the reconfiguration, the slow replica may use an old offset to psync with the new primary, which will cause a full sync. This may happen, but the probability should be small since the primary will call flushReplicasOutputBuffers to write as much as possible before shutdown.

So, wait for all, then trigger manual failover. If you want, we can add another wait after that (after "finish shutdown"), so the primary can vote for the replica before exit. Wdyt?

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

Oct 18 '24 09:10 enjoy-binbin

wait for the vote, i think both are OK. Even if we don't wait, I think the replica will have enough votes. If we really want to, we can even wait until the replica successfully becomes primary before exiting... Do you have a final decision? I will do whatever you think is right.

I'm thinking if there are any corner cases, like if the cluster is too small to have quorum without the shutting down primary...

If it is simple, I prefer to let the primary wait and vote. Then we can avoid the server.cluster->mf_is_primary_failover variable. I don't like this variable and special case. :)

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

Oct 18 '24 09:10 zuiderkwast

But if this implementation to wait for the vote will be too complex, then let's just skip the vote. I think it's also fine. Without this feature, we wait for automatic failover, which will also not have the vote from the already shutdown primary.

i am going to skip the vote for now, i tried a bit which seemed not easy and not good looking to finish it. Maybe I'll have a better idea later, i will keep it in mind.

Oct 27 '24 10:10 enjoy-binbin

This is an interesting idea. I like the direction we are going in but I agree with @zuiderkwast that potential data loss is not appealing.

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

Today, we can't forget "myself" nor "my primary" (with the latter being a dynamic state). This adds operational complexity. Imagine that the admin could just send CLUSTER FORGET to any node in the cluster and then the server will do the right thing, failing over the primaryship to one of its replicas, if applicable, and then broadcast the forget message to the cluster.

Oct 28 '24 04:10 PingXie

We can do both though IMO triggering a (graceful) failover as part of CLUSTER FORGET is more valuable than making it part of shutdown, because it is cleaner to forget a node prior to shutting it down in any production environment.

@PingXie Yes, it's a good idea, but this PR is about the scenario that the machine is taken down without the control of the Valkey admin. For example, in Kubernetes when a worker is shutdown, SIGTERM is sent to all processes and it waits for 30 seconds by default. When you shutdown your laptop, I believe it's similar, each application gets SIGTERM and has some time to be able to do a graceful shutdown.

Oct 28 '24 10:10 zuiderkwast

CLUSTER FAILOVER REPLICAID node-id can also be sent to a primary now? It replicates it and the replica will do the failover, right?

no, we won't replicate this command.

New config? Maybe someone wants to disable it because they reboot faster than a failover and the node is still a primary when it comes back?

yes, i think a new config may allow people to make a better transition, if there are any problems

I don't understand why you wrote "To avoid polluting the replication stream" in the top comment. What are we avoiding here?

i will remove it, it is a old stuff i guess, in the old version, the code has the issue

We can add test with an old replica in the cluster. I have merged https://github.com/valkey-io/valkey/pull/1371 so now it is easy to add a test like that. WDYT?

Sound good, i will take a look.

Jan 24 '25 05:01 enjoy-binbin

It seems odd we'll commit more out of band data into the replication stream

Yeah, that is indeed not 100% clean. We are reusing the replication stream for non-replication purposes.

if for any reason that replica doesn't take over none of the other ones will do it quickly.

I think this is fine and it fall back to the current behavior - a new primary will be elected after cluster-node-timeout.

can we not use a ping extension message to just broadcast to all the replicas that one of them should try to take over?

And mark itself failed? I think it is a good idea and it helps the cluster converge faster on this node's status too.

which relies on the primary sending a PSYNC command with a special flag indicating that it has stepped down which can get rejected.

Where do we do this today?

Feb 18 '25 00:02 PingXie

I don't really like the primary explicitly picks a replica and then sends it a failover request for all replicas into the replica buffer. It seems odd we'll commit more out of band data into the replication stream and if for any reason that replica doesn't take over none of the other ones will do it quickly.

Wdym none of the others will do it quickly? Only one can do it quickly and only the primary can select it. If all replicas try to get elected quickly, none of them gets a majority. That's why there's a random delay in automatic failover, which is what we want to avoid here.

If we use the cluster bus, the primary sends it to the chosen replica only? We could use a new message and a cluster bus capability flag instead then.

IIRC the idea of using the replication buffer was that the replica will get to it only when the replication is complete. Maybe we were overthinking it?

Feb 18 '25 00:02 zuiderkwast

It seems odd we'll commit more out of band data into the replication stream

Yeah, that is indeed not 100% clean. We are reusing the replication stream for non-replication purposes.

@PingXie Like we do with SETSLOT? It was a similar discussion. :)

Feb 18 '25 00:02 zuiderkwast

@PingXie Like we do with SETSLOT? It was a similar discussion. :)

Not exactly. We do want all replicas to execute SETSLOT. The thing in common though is that there is no user data involved in both cases. I can give you that :)

Feb 18 '25 00:02 PingXie

Btw, my opinion isn't that strong. I don't really mind another mechanism, it just seems weird that we are doing something yet another way.

Where do we do this today?

https://github.com/valkey-io/valkey/blob/f85c93301ca63f070164b307c6a7371a9ab5ac7b/src/replication.c#L1047

Wdym none of the others will do it quickly? Only one can do it quickly and only the primary can select it. If all replicas try to get elected quickly, none of them gets a majority. That's why there's a random delay in automatic failover, which is what we want to avoid here.

One of them will have the lowest delay, and yes it will have some random component, but the benefit is that there will be less complexity and only a single failure path. There is a belief at AWS that it's more important to have fewer failover pathways, and they should all reinforce each other since there is a lower chance of bugs there.

Not exactly. We do want all replicas to execute SETSLOT. The thing in common though is that there is no user data involved in both cases. I can give you that :)

Maybe it's just me, but I think it kind of is data related. It's whether or not the given node is able to serve data related to a given slot. I guess you could say this is giving ownership over all of the slots to a given replica. I think our handling of slot state is a bit of a mess, although that's not really related to this PR.

IIRC the idea of using the replication buffer was that the replica will get to it only when the replication is complete. Maybe we were overthinking it?

Maybe I'm overthinking it :D

Feb 18 '25 00:02 madolson

One of them will have the lowest delay, and yes it will have some random component, but the benefit is that there will be less complexity and only a single failure path.

@madolson So with this solution, we just skip the time it takes for the primary to be marked as FAIL. That's about half of the time it takes to get a new primary. If we do that, I think this feature misses the point.

Internally, we have a small script that starts valkey-server inside the same container, so valkey-server is a child process of the script. This script catches the SIGTERM from the OS (or kubernetes) and performs the failover by sending commands. It is very fast.

The purpose is to catch a worker reboot (machine, infra level) that is out of control of the valkey admin (or kubernetes operator). By any means necessary we want to avoid this 3 seconds downtime. (I think we use a node timeout of 2 or something.)

There is a belief at AWS that it's more important to have fewer failover pathways, and they should all reinforce each other since there is a lower chance of bugs there.

It's simplicity to the price of worse availability. We can (and we already do) handle this outside of valkey-server but that's more complex and non-standard solution that an official valkey-operator probably wouldn't use. It's more clean to just start valkey-server as it is.

Feb 18 '25 17:02 zuiderkwast

Bump @valkey-io/core-team Can we have this in 8.1?

Mar 18 '25 13:03 zuiderkwast

Discussed briefly in core team meeting. No more open questions, so once low level details are done we can merge it for 9.0.

Apr 07 '25 14:04 madolson

@hpatro @madolson Do want to complete your reviews or can we merge it?

Apr 07 '25 17:04 zuiderkwast

can we merge it?

I have a few new open questions. Will also reset my approval next

Apr 07 '25 19:04 PingXie

When reviewing #2195, I'm thinking back on this feature again. Here, we added a new config auto-failover-on-shutdown. Why didn't we instead add a new option failover to the existing configs for shutdown-on-sigterm and shutdown-on-sigint and an argument to the SHUTDOWN command (SHUTDOWN [FAILOVER]) to trigger the failover on shutdown based on how the shutdown is triggered? This is how other shutdown behavior is selected.

This is not released yet so we can still change this.

Jun 19 '25 12:06 zuiderkwast

right, i totally forgot it. I guess it might be the difference between active and passive? auto-failover-on-shutdown (and shutdown-on-sig failover for sure) is like a passive way, and SHUTDOWN FAILOVER is a active way.

And you are right, this anyway is not released and we can change it for good.

@valkey-io/core-team thoughts?

Jun 19 '25 14:06 enjoy-binbin

valkey valkey copied to clipboard

Trigger manual failover on SIGTERM / shutdown to cluster primary

Codecov Report

valkey
valkey copied to clipboard