Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[RFC]: dynamic replica management

Open zhongzhouTan-coder opened this issue 1 month ago • 11 comments

Changes proposed

Introduction

This RFC proposes a dynamic replica management system for Mooncake Store that adaptively adjusts replica configurations based on workload patterns.

Motivation

Mooncake Store currently uses a static replica configuration where each object has a fixed number of replicas determined at write time. This approach does not account for changing access patterns, leading to suboptimal performance and resource utilization. And also now the mooncake-store only get the first completion replica for read, which may not be the most efficient one compare to choose the local replica. So we need a mechanism to dynamically adjust replicas based on object "hotness" and aslo may need to optimize read replica selection.

This may introduce the following benefits:

  1. Improved Read Performance: Because the hot data is preferred to more memory replicas and will have more probability to be served in the local when read.
  2. Better Resource Utilization: Our memory will have more proportionally allocated to frequently accessed data, reducing waste.
  3. Optimized Read Replica Selection: By selecting the most efficient replica for read operations, we can further enhance performance.

This proposal introduces a comprehensive design including hot data detection algorithm, replica adjustment and placement strategy, and read replica selection optimization.

In Scope

  1. Hot data statistics and identification.
  2. Master side initialized replica adjustment (scale up/down).
  3. Optimized read replica selection based on locality.
  4. log/testing/monitoring

Out of Scope

  1. data promotion/demotion with multiple tier storage

Proposal

Architecture

Image

Architecture Description

Hot data detection algorithm - Access Frequency-Based Hot Data Detection(Basic Idea)

Description:

The algorithm tracks how many times each object is accessed within a sliding time window. Objects are classified as HOT, WARM, or COLD based on their access frequency and recency:

  • HOT: Accessed frequently
  • WARM: Moderate access patterns
  • COLD: Rarely accessed

Atomic Replica Primitives

Description:

To ensure consistency and reliability during replica management, we define two fundamental atomic operations on the Master: ReplicaCopy and ReplicaMove. These operations are designed as two-phase transactions to handle the long-running nature of data transfer while maintaining metadata consistency.

  1. ReplicaCopy (Scale Up)

    • Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node. The replica is marked as PROCESSING and is not visible to readers.
    • Phase 2: End (Commit) - After successful data transfer, Master marks the replica as COMPLETE. It becomes immediately visible to readers.
  2. ReplicaMove (Migration)

    • Phase 1: Start (Allocation) - Master allocates a new replica descriptor on the target node (marked PROCESSING).
    • Phase 2: End (Atomic Swap) - After successful data transfer, Master atomically:
      • Marks the new replica as COMPLETE.
      • Removes the source replica from the metadata.
    • This ensures that at no point is the object unavailable or the replica count inconsistent (from the perspective of availability).

Replica Adjustment Strategy - Master Initiated Dynamic Replica Adjustment

Description:

The master service tracks object temperature and makes decisions on replica adjustments (e.g., scale up/down). The Master actively orchestrates the replication process. Clients register themselves as "Workers" capable of executing commands. The Master selects a suitable Worker and sends RPC commands to trigger replica transfers.

This approach allows for global optimization of resource usage and transfer scheduling, decoupled from user read requests. It requires a bi-directional communication channel (or Reverse RPC) where the Master can invoke methods on the Client.

External Replica Control APIs

Description:

To support external systems that require explicit control over replica placement (e.g., for locality-aware scheduling), we provide explicit Copy and Move APIs. These allow clients to manually trigger replication or migration to specific nodes, utilizing the same atomic primitives as the internal scaling logic.

Roadmap

Phase 1: Master-Worker Communication Infrastructure

  • Implement WorkerService on Client side to handle Master commands.
  • Implement Worker Registry on Master side (Client registration).
  • Establish Reverse RPC mechanism (Master -> Client).

Phase 2: Replica Migration Primitives

  • Implement ReplicaCopy and ReplicaMove operations on Master.
  • Implement data transfer logic on Client (WorkerService).
  • Implement Copy and Move APIs for external control.

Phase 3: Hot Data Detection & Basic Framework

  • Implement AccessTracker and HotDataDetector in MasterService.
  • Integrate access recording into GetReplicaList.
  • Add metrics for access patterns and object temperature.

Phase 4: Dynamic Replica Increment (Scale Up)

  • Implement Master-side scaling logic (background thread).
  • Implement ExecuteReplicaCopy in WorkerService.
  • Integrate hot data detection with scaling decisions.
  • Basic load balancing for target node selection.

Phase 5: Replica Decrement & Optimization

  • Implement cold data detection logic.
  • Implement scale-down logic (garbage collection of excess replicas).
  • Optimize source/target selection for transfers (topology awareness).

Other Feature consideration

  • Graceful shutdown for the client https://github.com/kvcache-ai/Mooncake/issues/607

Current Plan (Phase 1 And Phase 2)

Currently we need support the Phase 1 and Phase 2 to build the basic Master-Worker communication infrastructure and replica migration primitives.

Design

Image
  1. add a new worker server in the client side to handle the master command.
  2. add a worker registry in the master side to manage the client registration and selection.
  3. add a new request handle logic in the master side to handle the ReplicaCopy and ReplicaMove request.
  4. exposure the Copy and Move APIs for external usage.

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues and read the documentation

zhongzhouTan-coder avatar Nov 24 '25 12:11 zhongzhouTan-coder

Welcome any discussion and suggestion~

zhongzhouTan-coder avatar Nov 24 '25 12:11 zhongzhouTan-coder

Hi, @zhongzhouTan-coder. Thanks for your contribution! We are developing some internal features that can benifit from the RFC. Can I help to implement some functionalities of the RFC?

nickyc975 avatar Nov 25 '25 02:11 nickyc975

Hello, @nickyc975 welcome to join the development and also thanks your support, and can we discuss more details on the slack?

zhongzhouTan-coder avatar Nov 25 '25 03:11 zhongzhouTan-coder

Hello, @nickyc975 welcome to join the development and also thanks your support, and can we discuss more details on the slack?

Sure!

nickyc975 avatar Nov 25 '25 03:11 nickyc975

@zhongzhouTan-coder @nickyc975 We should consider this case as well. #607

stmatengss avatar Nov 25 '25 05:11 stmatengss

Sure, this can be an optimization case such as graceful shutdown, we can consider it after the Phase 1 And Phase 2 complete.

zhongzhouTan-coder avatar Nov 25 '25 06:11 zhongzhouTan-coder

Sure, this can be an optimization case such as graceful shutdown, we can consider it after the Phase 1 And Phase 2 complete.

For Grace shutdown, we can migrate data to another live node

stmatengss avatar Nov 25 '25 10:11 stmatengss

Regarding replica migration, @zhuxinjie-nz is currently implementing a similar mechanism:

In short, he is building a system where the master creates a task queue for each client. Clients periodically send RPC requests to the master to fetch tasks from their queue and execute them. In this implementation, a client pulls a task from the queue, then reads an object from a remote node and writes it to its local storage. After completing the write, the client reports back to the master, and the master updates the metadata accordingly.

Perhaps we could consider reusing this mechanism by allowing the task queue to support multiple types of tasks — for example, the replica addition or migration tasks described in this RFC.

ykwd avatar Nov 28 '25 08:11 ykwd

@ykwd This is polling vs push and this is a good functionality for the replica migration to reuse, and we don not need to implement a reverse RPC from master to call client. And can you provide a more detailed description for the Task Queue based client pooling, so we can implement our move and copy based on that implementation?

zhongzhouTan-coder avatar Nov 28 '25 09:11 zhongzhouTan-coder

@ykwd This is polling vs push and this is a good functionality for the replica migration to reuse, and we don not need to implement a reverse RPC from master to call client. And can you provide a more detailed description for the Task Queue based client pooling, so we can implement our move and copy based on that implementation?

@zhuxinjie-nz Just updated this mechanism to this PR: https://github.com/kvcache-ai/Mooncake/pull/1031. Any comments or discussion are very welcome!

ykwd avatar Nov 28 '25 09:11 ykwd

@ykwd Thanks, I will check the implementation and implement the copy/move feature with the Task queue based client pooling mechanism and also I will continue to update the process and any encountered problems.

zhongzhouTan-coder avatar Nov 28 '25 10:11 zhongzhouTan-coder