valkey Fix primary crash when processing dirty slots during shutdown wait / failover wait / client pause

We have an assert in propagateNow. If the primary node receives a CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during a manual failover pausing or during a client pause, the delKeysInSlot call will trigger this assert and cause primary crash.

In this case, we added a new server_del_keys_in_slot state just like client_pause_in_transaction to track the state to avoid the assert in propagateNow, the dirty slots will be deleted in the end without affecting the data consistency.

Oct 06 '24 08:10 enjoy-binbin

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 70.73%. Comparing base (2743b7e) to head (1f12165). :warning: Report is 678 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1131      +/-   ##
============================================
+ Coverage     70.69%   70.73%   +0.03%     
============================================
  Files           114      114              
  Lines         63076    63078       +2     
============================================
+ Hits          44594    44618      +24     
+ Misses        18482    18460      -22

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.28% <100.00%> (-0.03%)`	:arrow_down:
src/networking.c	`88.60% <ø> (+0.04%)`	:arrow_up:
src/server.c	`88.74% <100.00%> (ø)`
src/server.h	`100.00% <ø> (ø)`

... and 13 files with indirect coverage changes

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
:package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Oct 06 '24 09:10 codecov[bot]

To be clear, is this what happens:

When a primary has lost a slot, it deletes all the keys and replicates DEL to its replicas? That's why there is some commands to replica even when write commands have been paused.

I think it can also happen (lost a slot) while a primary is waiting for manual failover, right?

Oct 07 '24 17:10 zuiderkwast

When a primary has lost a slot, it deletes all the keys and replicates DEL to its replicas? That's why there is some commands to replica even when write commands have been paused.

yes, it's the case.

I think it can also happen (lost a slot) while a primary is waiting for manual failover, right?

right, yes, it can also happen i guess, will try to add a test case to see

Oct 08 '24 02:10 enjoy-binbin

@zuiderkwast thanks, good catch with the manual failover one, i added a new test that can confirm it.

Oct 08 '24 16:10 enjoy-binbin

Thinking of backporting it to 7.2? I added the label, let me know if we don't want to backport it.

Oct 08 '24 16:10 enjoy-binbin

We don't seem to expose in info that the server may be in pause state and its purpose, do we need it?

Oct 09 '24 03:10 enjoy-binbin

Thinking of backporting it to 7.2? I added the label, let me know if we don't want to backport it.

Has there been a report that this issue occurred on 7.2?

Nov 05 '24 14:11 PingXie

Has there been a report that this issue occurred on 7.2?

Nope

Nov 05 '24 14:11 enjoy-binbin

Has there been a report that this issue occurred on 7.2?

Nope

Should we remove it from backport? It's a bit painful to cherry pick into 7.2 as it doesn't have the cluster split code.

Jan 06 '25 22:01 hpatro

Should we remove it from backport? It's a bit painful to cherry pick into 7.2 as it doesn't have the cluster split code.

ohh, i guess, if there is too much conflict, we can drop the backport.

Jan 07 '25 03:01 enjoy-binbin