valkey icon indicating copy to clipboard operation
valkey copied to clipboard

Fix CLUSTER SETSLOT block and unblock error when all replicas are down

Open enjoy-binbin opened this issue 1 year ago • 5 comments

In CLUSTER SETSLOT propagation logic, if the replicas are down, the client will get block during command processing and then unblock with NOREPLICAS Not enough good replicas to write.

The reason is that all replicas are down (or some are down), but myself->num_replicas is including all replicas, so the client will get block and always get timeout.

We should only wait for those normal replicas, otherwise the waiting propagation will always timeout since there are not enough replicas.

enjoy-binbin avatar Aug 08 '24 15:08 enjoy-binbin

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 70.59%. Comparing base (109cc21) to head (7a41f69). :warning: Report is 849 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #879      +/-   ##
============================================
+ Coverage     70.40%   70.59%   +0.19%     
============================================
  Files           112      112              
  Lines         61465    61510      +45     
============================================
+ Hits          43275    43425     +150     
+ Misses        18190    18085     -105     
Files with missing lines Coverage Δ
src/cluster_legacy.c 85.51% <100.00%> (+0.17%) :arrow_up:

... and 25 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Aug 08 '24 22:08 codecov[bot]

@enjoy-binbin I am not sure why this is a bug? We require strong consistency on purpose when replicating the slot migration states. Dropping the synchronization/wait opens up race conditions and reduces reliability. Is there a scenario where such a strong consistency is not desired?

PingXie avatar Aug 13 '24 07:08 PingXie

i think it should wait the online replica, but you also got a good point here, so in this case, you are suggesting user should fix the offline replica first and then fix the migration?

there may a case that a replica is remain fail but we are not forgetting it, so it is always here.

enjoy-binbin avatar Aug 13 '24 07:08 enjoy-binbin

we are struck in this must or not situation. I just thought a good point about why we should use online replica.

Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.

If we still performing the migration, we know the drawback, and we will wait the online replicas.

Of course there might be some conner case, but i think we can live with it.

enjoy-binbin avatar Aug 19 '24 16:08 enjoy-binbin

we are struck in this must or not situation. I just thought a good point about why we should use online replica.

Migrations is some kind of admin operation, if the replica is failed, we should know it before the migration. We can fix it or just leave it.

If we still performing the migration, we know the drawback, and we will wait the online replicas.

Of course there might be some conner case, but i think we can live with it.

I think you have a good point. The admin can easily check if there are replicas that are down for an extended period of time. If they decide to move forward anyways, we should not block it. If a replica failed right before the replication and didn't get included in the replication, it would also unlikely win the election. This PR makes sense to me. Thanks!

PingXie avatar Aug 20 '24 20:08 PingXie