valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] Full Sync from Replica

Open murphyjacob4 opened this issue 2 months ago • 3 comments

The problem/use-case that the feature addresses

Full synchronization causes a large burst in resource consumption on the primary node. Oftentimes operators have a preference to consume resources on replicas instead of primaries to prevent impact to traffic served by the primary.

Description of the feature

Introduce a configuration CONFIG SET repl-prefer-sync-from-replica yes

When the configuration is enabled, CLUSTER REPLICATE will first identify an eligible replica of the specified node (if one is available).

The new replica would simultaneously connect to the old primary and the existing replica. First, the new replica would send PSYNC to the existing replica and use a command like REPLCONF SNAPSHOT-ONLY during the replication handshake. If the replica does not support the configuration, we would fallback to sending PSYNC to the primary. But if it does support it, the existing replica would send the RDB snapshot to the new replica. Simultaneously, the new replica can do PSYNC from the offset of the snapshot to the primary (a la dual channel replication).

If we want, you could support a non-dual-channel flavor as well, but we would need to inform the primary to hold a pointer to the offset in the repl-backlog to prevent failed PSYNC after the snapshot is loaded.

Alternatives you've considered

You could make CLUSTER REPLICATE specify it rather than a config, either CLUSTER REPLICATE <primary> SYNCFROM <replica>, or CLUSTER REPLICATE <primary> SYNCFROMREPLICA, or even CLUSTER REPLICATE <replica> (although the last one would be breaking I think). Or maybe we change it to a shard based CLUSTER JOINSHARD <shardid> USING <replica>

We could also support chaining replication. But I feel like to do chaining replication correctly, it wouldn't be a static "Node A is configured to replicate to B is configured to replicate to C" but rather a dynamic "Node A is primary of shard 1, B is a replica of shard 1, C is a replica of shard 1, and customer supplied a maximum chaining depth of 2 so we can do A->B->C, but if B dies, we can reconfigure as A->C still". Or maybe instead of chaining depth, it could be a per-node replica max (1 is A->B, B->C, C->D, 2 is A->B, A->C, B->D, etc.)

Additional information

Atomic slot migration would also benefit from similar functionality.

murphyjacob4 avatar Oct 24 '25 06:10 murphyjacob4

This is a great idea!

When we were running 6.2, we used to do upgrades by adding new replicas as sub-replicas and then triggering a failover. This used to work in 6.2 but later sub-replicas were detected and rejected. I can imagine it might still be possible to add a sub-replica if it's not yet part of the cluster (hasn't yet sent CLUSTER MEET) and only after the full sync, add it using CLUSTER MEET. It'd be interesting to know if this works.

zuiderkwast avatar Oct 24 '25 09:10 zuiderkwast

When we were running 6.2, we used to do upgrades by adding new replicas as sub-replicas and then triggering a failover. This used to work in 6.2 but later sub-replicas were detected and rejected. I can imagine it might still be possible to add a sub-replica if it's not yet part of the cluster (hasn't yet sent CLUSTER MEET) and only after the full sync, add it using CLUSTER MEET. It'd be interesting to know if this works.

I don't think that's possible. REPLICAOF is disabled in cluster-enabled mode and CLUSTER REPLICATE doesn't allow sub-replica. We could consider loosening the constraints though like @murphyjacob4 has mentioned.

We need to think through the shard-id behavior. We would need to maintain a single shard-id across the shard.

When this question came up during the meetup, I was thinking with durability setup this would be even more beneficial as the dataset is same across all the nodes and can take away load from the primary nodes.

hpatro avatar Oct 24 '25 18:10 hpatro

Discussed this in the weekly meeting. There was a broad consensus about this feature being useful, but the specific design is still TBD. Nobody is specifically looking to implement it.

Some open questions we haven't yet decided.

  1. The full-sync will always happen on the replica, but we could stream the ongoing change from the primary or the replica. If we stream it from the replica, then we will need to support a second psync from the primary. Someone could take ownership of just this.
  2. How to integrate this into the cluster mode. It can either be a first class component of the topology or it can just be a transient step to get a new node in the cluster without putting pressure on the primary.

We need a more concrete design as a next step.

madolson avatar Nov 03 '25 16:11 madolson