valkey [BUG] cluster rebalance --cluster-weight <node>=0 fails with clusterManagerMoveSlot: NOREPLICAS error

Describe the bug

Testing with the 8.0.0-rc1 load, during a rebalance launched from valkey-cli to remove all the shards from a master, an error is seen if the master is configured with 'cluster-allow-replica-migration no'.

If 'cluster-allow-replica-migration yes', then the command succeeds.

This is different behaviour compared to valkey 7.2.6 where the command succeeds when 'cluster-allow-replica-migration no'

To reproduce

Create an 8 node cluster, 4 primaries, 4 replicas:

valkey-cli --cluster-yes --cluster create 127.0.0.1:6701 127.0.0.1:6702 127.0.0.1:6703 127.0.0.1:6704 127.0.0.1:6705 127.0.0.1:6706 127.0.0.1:6707 127.0.0.1:6708 --cluster-replicas 1

Pick one of the primaries, in my case, node 4, get the node id and set 'cluster-allow-replica-migration no'

valkey-cli -p 6704 cluster myid
valkey-cli -p 6704 config set cluster-allow-replica-migration no

Then rebalance the cluster so it removes all shards from one of the masters:

valkey-cli --cluster-yes --cluster rebalance 127.0.0.1:6701 --cluster-use-empty-masters --cluster-weight 5c8e38063de84c529df1aa23ee431efd4bc4b521=0
>>> Performing Cluster Check (using node 127.0.0.1:6701)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Rebalancing across 4 nodes. Total weight = 3.00
Moving 1366 slots from 127.0.0.1:6704 to 127.0.0.1:6701
######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################
Moving 1365 slots from 127.0.0.1:6704 to 127.0.0.1:6702
#####################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################
Moving 1365 slots from 127.0.0.1:6704 to 127.0.0.1:6703
####################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################*** clusterManagerMoveSlot: NOREPLICAS Not enough good replicas to write.

Despite the error, the cluster check appears ok I think:

$ valkey-cli --cluster check 127.0.0.1:6701
127.0.0.1:6701 (75247bf8...) -> 0 keys | 5462 slots | 1 replicas.
127.0.0.1:6702 (77c23d69...) -> 0 keys | 5461 slots | 1 replicas.
127.0.0.1:6704 (5c8e3806...) -> 0 keys | 0 slots | 0 replicas.
127.0.0.1:6703 (f8a227e7...) -> 0 keys | 5461 slots | 2 replicas.
[OK] 0 keys in 4 primaries.
0.00 keys per slot on average.
>>> Performing Cluster Check (using node 127.0.0.1:6701)
M: 75247bf89c9b3a38c863b50e9a2e67ec914df05a 127.0.0.1:6701
   slots:[0-4095],[12288-13653] (5462 slots) master
   1 additional replica(s)
M: 77c23d691836fba3ce1e6f4951479d634e13b8da 127.0.0.1:6702
   slots:[4096-8191],[13654-15018] (5461 slots) master
   1 additional replica(s)
S: d3e834062a061c5356b1c5c2c3ac8adeaae403bb 127.0.0.1:6708
   slots: (0 slots) slave
   replicates 75247bf89c9b3a38c863b50e9a2e67ec914df05a
S: 49f5594a5100af4704892e707e6576d6609ed645 127.0.0.1:6706
   slots: (0 slots) slave
   replicates 77c23d691836fba3ce1e6f4951479d634e13b8da
S: b524e9c3fc46e18925b5bd910177b67310c0d990 127.0.0.1:6705
   slots: (0 slots) slave
   replicates f8a227e729438df3d8a65b1d5a47eccab688374f
M: 5c8e38063de84c529df1aa23ee431efd4bc4b521 127.0.0.1:6704
   slots: (0 slots) master
M: f8a227e729438df3d8a65b1d5a47eccab688374f 127.0.0.1:6703
   slots:[8192-12287],[15019-16383] (5461 slots) master
   2 additional replica(s)
S: 17634ce39801d2275cc6f3fbd2eab5a50b40f79e 127.0.0.1:6707
   slots: (0 slots) slave
   replicates f8a227e729438df3d8a65b1d5a47eccab688374f
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

Expected behavior

I expect this should not error as the it works in valkey 7.2.6.

Additional information

All server logs are attached. logs.tar.gz

Aug 12 '24 20:08 jdork0

thank for the report. I think https://github.com/valkey-io/valkey/pull/879 can fix it. I will check it later

edit: i see the PR does not cover this case, i will handle it later

Aug 13 '24 00:08 enjoy-binbin

@PingXie in this case, the primary is cluster-allow-replica-migration no and its replica is cluster-allow-replica-migration yes, during the migration:

primary calling blockClientForReplicaAck, waiting its replica
its replica reconfiguring itself as a replica of other shards, and disconnect from the old primary
the old primary never got the chance to receive the ack, so it got a timeout and got a NOREPLICAS error

I can't think of a good way, do you have any ideas?

A tcl test that can reproduce this:

# Allocate slot 0 to the last primary and evenly distribute the remaining
# slots to the remaining primary.
proc my_slot_allocation {masters replicas} {
    set avg [expr double(16384) / [expr $masters-1]]
    set slot_start 1
    for {set j 0} {$j < $masters-1} {incr j} {
        set slot_end [expr int(ceil(($j + 1) * $avg) - 1)]
        R $j cluster addslotsrange $slot_start $slot_end
        set slot_start [expr $slot_end + 1]
    }
    R [expr $masters-1] cluster addslots 0
}

start_cluster 4 4 {tags {external:skip cluster} overrides {cluster-node-timeout 1000 cluster-migration-barrier 999}} {
    test "Simple test" {
        # Move the slot 0 from primary 3 to primary 0
        set addr "[srv 0 host]:[srv 0 port]"
        set myid [R 3 CLUSTER MYID]
        R 3 config set cluster-allow-replica-migration no
        R 3 debug log binbinzhubinbinzhu
        set code [catch {
            exec src/valkey-cli {*}[valkeycli_tls_config "./tests"] --cluster rebalance $addr --cluster-weight $myid=0
        } result]
        if {$code != 0} {
            fail "valkey-cli --cluster rebalance returns non-zero exit code, output below:\n$result"
        }
    }
} my_slot_allocation cluster_allocate_replicas ;# start_cluster

Aug 13 '24 04:08 enjoy-binbin

Right I don't think #879 would help with this issue.

I think one option could be waiting for n-1 replicas in this case instead, in anticipation of the potential replica loss (due to the migration). The reliability would be a bit lower but still better than not having the synchronization at all.

Aug 13 '24 07:08 PingXie

n-1 is a hard-coded number, if the primary got more replicas, it will still got the error here. One way is to update num_replicas when the replica is disconnected (we remember we are waiting for it), but this does not feel good either.

Aug 13 '24 07:08 enjoy-binbin

I did a bit more investigation and it is failing on the second call to clusterManagerSetSlot in clusterManagerMoveSlot where it sets it on the source node.

Would it be ok to add NOREPLICAS to the list of acceptable errors?

Aug 13 '24 11:08 jdork0

Would it be ok to add NOREPLICAS to the list of acceptable errors?

@jdork0 you are right. There is no harm in this case to ignore the error. In fact, I think this situation is essentially the same as the case where both the source primary and replicas become replicas of the target shard. The last CLUSTER SETSLOT NODE call will fail on the source primary with "Please use SETSLOT only with primaries" and we explicitly ignore it.

Aug 20 '24 20:08 PingXie

i am guessing we can't simply ignore the NOREPLICAS error here? otherwise we will lost the CLUSTER SETSLOT command in the source node view. (though it may be harmless, the dst node will gossip the message eventually)

ok, lets ignore the NOREPLICAS error in the source node side.

Aug 21 '24 06:08 enjoy-binbin

ok, lets ignore the NOREPLICAS error in the source node side.

Right. We should only ignore it on the source side.

Aug 22 '24 05:08 PingXie