Fix CLUSTER SETSLOT block and unblock error when all replicas are down
In CLUSTER SETSLOT propagation logic, if the replicas are down, the
client will get block during command processing and then unblock
with NOREPLICAS Not enough good replicas to write.
The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout.
We should only wait for those normal replicas, otherwise the waiting propagation will always timeout since there are not enough replicas.
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 70.59%. Comparing base (109cc21) to head (7a41f69).
:warning: Report is 849 commits behind head on unstable.
Additional details and impacted files
@@ Coverage Diff @@
## unstable #879 +/- ##
============================================
+ Coverage 70.40% 70.59% +0.19%
============================================
Files 112 112
Lines 61465 61510 +45
============================================
+ Hits 43275 43425 +150
+ Misses 18190 18085 -105
| Files with missing lines | Coverage Δ | |
|---|---|---|
| src/cluster_legacy.c | 85.51% <100.00%> (+0.17%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
@enjoy-binbin I am not sure why this is a bug? We require strong consistency on purpose when replicating the slot migration states. Dropping the synchronization/wait opens up race conditions and reduces reliability. Is there a scenario where such a strong consistency is not desired?
i think it should wait the online replica, but you also got a good point here, so in this case, you are suggesting user should fix the offline replica first and then fix the migration?
there may a case that a replica is remain fail but we are not forgetting it, so it is always here.
we are struck in this must or not situation. I just thought a good point about why we should use online replica.
Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.
If we still performing the migration, we know the drawback, and we will wait the online replicas.
Of course there might be some conner case, but i think we can live with it.
we are struck in this must or not situation. I just thought a good point about why we should use online replica.
Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.
If we still performing the migration, we know the drawback, and we will wait the online replicas.
Of course there might be some conner case, but i think we can live with it.
I think you have a good point. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and didn't get included in the replication, it would also unlikely win the election. This PR makes sense to me. Thanks!