[RFC]: dynamic replica management
Changes proposed
Introduction
This RFC proposes a dynamic replica management system for Mooncake Store that adaptively adjusts replica configurations based on workload patterns.
Motivation
Mooncake Store currently uses a static replica configuration where each object has a fixed number of replicas determined at write time. This approach does not account for changing access patterns, leading to suboptimal performance and resource utilization. And also now the mooncake-store only get the first completion replica for read, which may not be the most efficient one compare to choose the local replica. So we need a mechanism to dynamically adjust replicas based on object "hotness" and aslo may need to optimize read replica selection.
This may introduce the following benefits:
- Improved Read Performance: Because the hot data is preferred to more memory replicas and will have more probability to be served in the local when read.
- Better Resource Utilization: Our memory will have more proportionally allocated to frequently accessed data, reducing waste.
- Optimized Read Replica Selection: By selecting the most efficient replica for read operations, we can further enhance performance.
This proposal introduces a comprehensive design including hot data detection algorithm, replica adjustment and placement strategy, and read replica selection optimization.
In Scope
- Hot data statistics and identification.
- Master side initialized replica adjustment (scale up/down).
- Optimized read replica selection based on locality.
- log/testing/monitoring
Out of Scope
- data promotion/demotion with multiple tier storage
Proposal
Architecture
Architecture Description
Hot data detection algorithm - Access Frequency-Based Hot Data Detection(Basic Idea)
Description:
The algorithm tracks how many times each object is accessed within a sliding time window. Objects are classified as HOT, WARM, or COLD based on their access frequency and recency:
- HOT: Accessed frequently
- WARM: Moderate access patterns
- COLD: Rarely accessed
Atomic Replica Primitives
Description:
To ensure consistency and reliability during replica management, we define two fundamental atomic operations on the Master: ReplicaCopy and ReplicaMove. These operations are designed as two-phase transactions to handle the long-running nature of data transfer while maintaining metadata consistency.
-
ReplicaCopy (Scale Up)
-
Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node. The replica is marked as
PROCESSINGand is not visible to readers. -
Phase 2: End (Commit) - After successful data transfer, Master marks the replica as
COMPLETE. It becomes immediately visible to readers.
-
Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node. The replica is marked as
-
ReplicaMove (Migration)
-
Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node (marked
PROCESSING). -
Phase 2: End (Atomic Swap) - After successful data transfer, Master atomically:
- Marks the new replica as
COMPLETE. - Removes the source replica from the metadata.
- Marks the new replica as
- This ensures that at no point is the object unavailable or the replica count inconsistent (from the perspective of availability).
-
Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node (marked
Replica Adjustment Strategy - Master Initiated Dynamic Replica Adjustment
Description:
The master service tracks object temperature and makes decisions on replica adjustments (e.g., scale up/down). The Master actively orchestrates the replication process. Clients register themselves as "Workers" capable of executing commands. The Master selects a suitable Worker and sends RPC commands to trigger replica transfers.
This approach allows for global optimization of resource usage and transfer scheduling, decoupled from user read requests. It requires a bi-directional communication channel (or Reverse RPC) where the Master can invoke methods on the Client.
External Replica Control APIs
Description:
To support external systems that require explicit control over replica placement (e.g., for locality-aware scheduling), we provide explicit Copy and Move APIs. These allow clients to manually trigger replication or migration to specific nodes, utilizing the same atomic primitives as the internal scaling logic.
Roadmap
Phase 1: Master-Worker Communication Infrastructure
- Implement
WorkerServiceon Client side to handle Master commands. - Implement Worker Registry on Master side (Client registration).
- Establish Reverse RPC mechanism (Master -> Client).
Phase 2: Replica Migration Primitives
- Implement
ReplicaCopyandReplicaMoveoperations on Master. - Implement data transfer logic on Client (WorkerService).
- Implement
CopyandMoveAPIs for external control.
Phase 3: Hot Data Detection & Basic Framework
- Implement
AccessTrackerandHotDataDetectorin MasterService. - Integrate access recording into
GetReplicaList. - Add metrics for access patterns and object temperature.
Phase 4: Dynamic Replica Increment (Scale Up)
- Implement Master-side scaling logic (background thread).
- Implement
ExecuteReplicaCopyin WorkerService. - Integrate hot data detection with scaling decisions.
- Basic load balancing for target node selection.
Phase 5: Replica Decrement & Optimization
- Implement cold data detection logic.
- Implement scale-down logic (garbage collection of excess replicas).
- Optimize source/target selection for transfers (topology awareness).
Other Feature consideration
- Graceful shutdown for the client https://github.com/kvcache-ai/Mooncake/issues/607
Current Plan (Phase 1 And Phase 2)
Currently we need support the Phase 1 and Phase 2 to build the basic Master-Worker communication infrastructure and replica migration primitives.
Design
- add a new worker server in the client side to handle the master command.
- add a worker registry in the master side to manage the client registration and selection.
- add a new request handle logic in the master side to handle the
ReplicaCopyandReplicaMoverequest. - exposure the
CopyandMoveAPIs for external usage.
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues and read the documentation
Welcome any discussion and suggestion~
Hi, @zhongzhouTan-coder. Thanks for your contribution! We are developing some internal features that can benifit from the RFC. Can I help to implement some functionalities of the RFC?
Hello, @nickyc975 welcome to join the development and also thanks your support, and can we discuss more details on the slack?
Hello, @nickyc975 welcome to join the development and also thanks your support, and can we discuss more details on the slack?
Sure!
@zhongzhouTan-coder @nickyc975 We should consider this case as well. #607
Sure, this can be an optimization case such as graceful shutdown, we can consider it after the Phase 1 And Phase 2 complete.
Sure, this can be an optimization case such as
graceful shutdown, we can consider it after thePhase 1 And Phase 2complete.
For Grace shutdown, we can migrate data to another live node
Regarding replica migration, @zhuxinjie-nz is currently implementing a similar mechanism:
In short, he is building a system where the master creates a task queue for each client. Clients periodically send RPC requests to the master to fetch tasks from their queue and execute them. In this implementation, a client pulls a task from the queue, then reads an object from a remote node and writes it to its local storage. After completing the write, the client reports back to the master, and the master updates the metadata accordingly.
Perhaps we could consider reusing this mechanism by allowing the task queue to support multiple types of tasks — for example, the replica addition or migration tasks described in this RFC.
@ykwd This is polling vs push and this is a good functionality for the replica migration to reuse, and we don not need to implement a reverse RPC from master to call client. And can you provide a more detailed description for the Task Queue based client pooling, so we can implement our move and copy based on that implementation?
@ykwd This is polling vs push and this is a good functionality for the replica migration to reuse, and we don not need to implement a reverse RPC from master to call client. And can you provide a more detailed description for the Task Queue based client pooling, so we can implement our move and copy based on that implementation?
@zhuxinjie-nz Just updated this mechanism to this PR: https://github.com/kvcache-ai/Mooncake/pull/1031. Any comments or discussion are very welcome!
@ykwd Thanks, I will check the implementation and implement the copy/move feature with the Task queue based client pooling mechanism and also I will continue to update the process and any encountered problems.