Fix primary crash when processing dirty slots during shutdown wait / failover wait / client pause
We have an assert in propagateNow. If the primary node receives a CLUSTER UPDATE such as dirty slots during SIGTERM waitting or during a manual failover pausing or during a client pause, the delKeysInSlot call will trigger this assert and cause primary crash.
In this case, we added a new server_del_keys_in_slot state just like client_pause_in_transaction to track the state to avoid the assert in propagateNow, the dirty slots will be deleted in the end without affecting the data consistency.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 70.73%. Comparing base (2743b7e) to head (1f12165).
:warning: Report is 678 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #1131 +/- ##
============================================
+ Coverage 70.69% 70.73% +0.03%
============================================
Files 114 114
Lines 63076 63078 +2
============================================
+ Hits 44594 44618 +24
+ Misses 18482 18460 -22
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/cluster_legacy.c | 86.28% <100.00%> (-0.03%) |
:arrow_down: |
| src/networking.c | 88.60% <ø> (+0.04%) |
:arrow_up: |
| src/server.c | 88.74% <100.00%> (ø) |
|
| src/server.h | 100.00% <ø> (ø) |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
To be clear, is this what happens:
When a primary has lost a slot, it deletes all the keys and replicates DEL to its replicas? That's why there is some commands to replica even when write commands have been paused.
I think it can also happen (lost a slot) while a primary is waiting for manual failover, right?
When a primary has lost a slot, it deletes all the keys and replicates DEL to its replicas? That's why there is some commands to replica even when write commands have been paused.
yes, it's the case.
I think it can also happen (lost a slot) while a primary is waiting for manual failover, right?
right, yes, it can also happen i guess, will try to add a test case to see
@zuiderkwast thanks, good catch with the manual failover one, i added a new test that can confirm it.
Thinking of backporting it to 7.2? I added the label, let me know if we don't want to backport it.
We don't seem to expose in info that the server may be in pause state and its purpose, do we need it?
Thinking of backporting it to 7.2? I added the label, let me know if we don't want to backport it.
Has there been a report that this issue occurred on 7.2?
Has there been a report that this issue occurred on 7.2?
Nope
Has there been a report that this issue occurred on 7.2?
Nope
Should we remove it from backport? It's a bit painful to cherry pick into 7.2 as it doesn't have the cluster split code.
Should we remove it from backport? It's a bit painful to cherry pick into 7.2 as it doesn't have the cluster split code.
ohh, i guess, if there is too much conflict, we can drop the backport.