valkey
valkey copied to clipboard
Slot migration improvement
Overview
This PR significantly enhances the reliability and automation of the Valkey cluster re-sharding process, specifically during slot migrations in the face of primary failures. These updates address critical failure issues that previously required extensive manual intervention and could lead to data loss or inconsistent cluster states.
Enhancements
Automatic Failover Support in Empty Shards
The cluster now supports automatic failover in shards that do not own any slots, which is common during scaling operations. This improvement ensures high availability and resilience from the outset of shard expansion.
Replication of Slot Migration States
All CLUSTER SETSLOT
commands are now initially executed on replica nodes before the primary. This ensures that the slot migration state is consistent within the shard, preventing state loss in the event of primary failure. A new timeout parameter has been introduced, allowing users to specify the duration in milliseconds to wait for replication to complete, with a default set at 2 seconds.
CLUSTER SETSLOT slot { IMPORTING node-id | MIGRATING node-id | NODE node-id | STABLE } [ TIMEOUT timeout ]
Recovery of Logical Migration Links
The update automatically repairs the logical links between source and target nodes during failovers. This ensures that requests are correctly redirected to the new primary in the target shard after a primary failure, maintaining cluster integrity.
Enhanced Support for New Replicas
New replicas added to shards involved in slot migrations will now automatically inherit the slot's migration state as part of their initialization. This ensures that new replicas are immediately consistent with the rest of the shard.
Improved Logging for Slot Migrations
Additional logging has been implemented to provide operators with clearer insights into the slot migration processes and automatic recovery actions, aiding in monitoring and troubleshooting.
Additional Changes
cluster-allow-replica-migration
When cluster-allow-replica-migration
is disabled, primary nodes that lose their last slot to another shard will no longer automatically become replicas of the receiving shard. Instead, they will remain in their own shards, which will now be empty, having no slots assigned to them.
Fix #21