Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[RFC]: Mooncake-Conductor: Design and Implementation of a Global Scheduler Module for KV-Cache-Centric Disaggregated Architecture

Open Asher-XunZhang opened this issue 2 months ago • 8 comments

Changes proposed

1. Introduction

This proposal outlines the design and implementation of the "KVCache-centric Scheduling Algorithm" referenced in the associated research paper, and the Mooncake-Conductor module that realizes this algorithm – the Global Scheduler for the Mooncake Disaggregated Architecture. The core responsibility of the Conductor remains aligned with the paper's central thesis: to be KVCache-centric, thereby intelligently assigning each user request to one Prefill Instance and one Decode Instance, while managing the data layout within the distributed KVCache pool. The objective is to maximize system-level goodput while adhering to TTFT and TBT SLOs.

2. Motivation and Background

The Mooncake Architecture addresses resource utilization bottlenecks and SLO assurance challenges in long-context, high-concurrency scenarios prevalent in traditional coupled architectures by constructing an independent, distributed KVCache pool (Mooncake Store). However, the high performance of this disaggregated architecture is contingent upon a "brain" capable of global optimized scheduling.

  • Current Limitations: The current collaboration between Mooncake and inference engines like vLLM and SGLang primarily involves framework-level adaptation for the storage pool. A dedicated, strongly-stateful, predictable global scheduler deeply integrated with Mooncake's design philosophy has not yet been implemented as an independent module.
  • Core Value: Implementing the Mooncake-Conductor module will enable:
    • Unified Scheduling Policy: Provide an official, standard scheduling implementation for the Mooncake architecture, reducing integration complexity for users.
    • Unlocking Architectural Potential: Fully leverage the performance and cost advantages of the disaggregated architecture through precise cache-awareness, load balancing, and predictive scheduling. Technical reports indicate this can improve throughput by up to 525% in simulated scenarios.
    • Ecosystem Development: A well-defined, extensible Conductor interface will facilitate the integration of more inference engines and hardware accelerators into the Mooncake ecosystem.

3. Detailed Design

3.1 Architectural Overview and Core Data Flow

Architectural Description

The implementation of the Mooncake Conductor should closely collaborate with the Mooncake Store, effectively "sandwiching" inference frameworks like vLLM and SGLang in between, forming a three-layer architecture analogous to a "sandwich" structure. This interpretation is accurate and aligns with the design principles outlined in the Mooncake technical report .

The core concept of Mooncake is the KVCache-centric Disaggregated Architecture . The division of responsibilities among its key components is as follows:

• Top Layer (Scheduling Layer): The Mooncake Conductor acts as the global scheduler, serving as the "brain" of the system. It does not perform specific computations but is responsible for macroscopic resource orchestration and task dispatching .

• Middle Layer (Compute Layer): Inference frameworks such as vLLM and SGLang function as computational engines, "sandwiched" in the middle. They focus on efficiently executing Prefill or Decode computation tasks on individual nodes. These frameworks are agnostic to the global resource distribution, such as the state of other nodes or the global location of the KVCache .

• Bottom Layer (Storage and Data Plane Layer): The Mooncake Store (distributed KVCache storage) and the Mooncake Transfer Engine (high-performance data transfer engine, e.g., the RDMA-based Messenger) constitute the underlying storage and data movement infrastructure .

Consequently, these three layers of components form a cohesive "scheduling-compute-storage" sandwich structure .

Mooncake-Conductor

Mooncake-Conductor is deployed as an independent process or service, acting as the entry point for requests into the Mooncake cluster. Its core data flow is as follows:

flowchart TD

A[User Request] --> C(Conductor)
subgraph Conductor["Global Scheduler (Conductor)"]
    direction TB
    C --> P[Cache-aware Prefill Scheduler]
    P --> D[Load-balance Decoding Scheduler]
    K[KVCache Balance Scheduler]
end

P -- Assigns Prefill Instance to Request --> PF
D -- Assigns Decode Instance to Request --> DC

subgraph PF["Prefill Instance"]
    PF1[Load Reusable Prefix KVCache<br>from Mooncake Store]
    PF2[Perform Incremental Prefill Computation]
    PF3[Write Newly Generated Delta KVCache<br>back to Mooncake Store]
    PF1 --> PF2 --> PF3
end

subgraph MS[Mooncake Store]
    MS1[(Distributed KVCache Pool)]
end

subgraph DC["Decode Instance"]
    DC1[Asynchronously Load Full KVCache<br>from Mooncake Store to GPU]
    DC2[Perform Decoding<br>Generate Tokens]
    DC1 --> DC2
end

PF1 -- Get() Prefix KVCache --> MS1
PF3 -- Put() Delta KVCache --> MS1
MS1 -- Get() Full KVCache --> DC1
DC2 -- Stream Output --> U[Return Tokens to User]

K -- Background Async Management: <br>Hotspot Migration/Cache Eviction --> MS1

3.2 Core Scheduler Design and Algorithms

3.2.1 Cache-aware Prefill Scheduler

  • Core Objective: To select a Prefill Instance for a new request, aiming to minimize Time-To-First-Token (TTFT).
  • Core Algorithm: Cost Function-driven Multi-objective Optimization.
# Pseudocode Example (Python syntax): Estimated TTFT Cost Function

def estimate_ttft_cost(prefill_instance, request):
    # 1. Cache Match Evaluation
    prefix_match_length = query_global_prefix_tree(request.prompt, prefill_instance.associated_cache_blocks)
    cache_transfer_time = calculate_transfer_time(prefix_match_length, cache_location)  # Consider cache location (HBM/DRAM/SSD)
    
    # 2. Instance Load Evaluation
    queue_wait_time = estimate_queue_delay(prefill_instance.current_queue_depth)
    computation_time = estimate_prefill_compute_time(request.length - prefix_match_length)

    # 3. Aggregate Cost
    total_estimated_ttft = cache_transfer_time + queue_wait_time + computation_time
    # 4. Subtract Cache Reuse Benefit (manifested as reduced compute time)
    cache_benefit = alpha * prefix_match_length  # alpha is a weighting factor

    cost = total_estimated_ttft - cache_benefit
    return cost
  • Decision Mechanism: Involves a multi-objective optimization decision. The scheduler calculates a "cost" for each available Prefill instance, primarily balancing two factors:
    • Cache Locality: Assesses the presence of reusable KVCache (prefix cache) locally or in the associated Mooncake Store. Reusing cache can skip redundant computation, significantly shortening prefill time.
    • Instance Load (Challenge): Checks the instance's current queue depth and estimated computation time, avoiding sending requests to already busy instances causing queuing delays.
  • Inspirations: Draws from the global cache state awareness and cost function of NVIDIA Dynamo's intelligent router, and the hybrid routing strategy of SGLang Router (cache-first when system balanced, load-first when imbalanced).
  • Challenge Analysis:
    • Asynchronous KVCache transfer times are difficult to predict; estimation errors might be significant in deployment, potentially leading to mis-scheduling if the prediction model lacks flexibility.
    • Avoid forcibly assigning requests to the instance with the longest prefix match; requires a reasonably accurate cost function to weigh the trade-offs between transfer efficiency and recomputation efficiency.
    • Requires a mechanism to record and update the "node --- KVCache stored by the node" mapping.

3.2.2 Load-balance Decoding Scheduler

  • Core Objective: To select a Decode Instance for requests that have completed the prefill phase, aiming to optimize throughput while meeting TBT SLO requirements.
  • Core Algorithm: Centered on Load Balancing and SLO Compliance. The scheduler examines the current load status of all Decode instances, such as each instance's batch size and queue length. The goal is to assign new requests to the least loaded instance to prevent TBT spikes due to instance overload.
    • Load Scoring: load_score = current_batch_size + k * waiting_queue_length (where k is a penalty coefficient).
    • SLO Compliance Check (Challenge): Predict the TBT of each Decode instance over a future time window, only assigning requests to instances predicted to meet their TBT SLO.
    • Early Rejection Mechanism: If predictions indicate no instance can satisfy the request's TBT SLO upon completion, reject the request outright before the prefill phase to avoid resource waste.
  • Inspirations: Incorporates ideas from Ant's AI Gateway regarding multi-objective trade-offs and SLO guarantee, and SGLang Router's dynamic load assessment and anti-hotspot mechanism (random selection from a low-load group).
  • Challenge Analysis:
    • Requires predicting the TBT for each Decode instance over a future time window.

3.2.3 KVCache Balance Scheduler

  • Core Objective: Functions as a background daemon managing the data distribution of KVCache within the Mooncake Store, preventing access hotspots and bottlenecks to enhance the overall efficiency and throughput of the cache system.
  • Core Algorithm: Heat-aware Hierarchical Storage and Automated Hotspot Migration.
    • Heat Quantification (Challenge): Compute a heat value for KVCache blocks based on factors like access frequency, recent access time, and the SLO tier of associated requests.
    • Data Migration: Monitor access frequency to identify "hot" cache blocks (e.g., frequently accessed common prefixes) and potentially create replicas across multiple storage nodes for load distribution.
      • Hierarchical Storage:
        • Hot Data Promotion: Migrate or replicate high-frequency access KVCache blocks to faster storage tiers (e.g., GPU HBM).
        • Cold Data Demotion: Move low-frequency access data to cheaper storage (e.g., SSD), employing strategies like LRU/LFU.
      • Hotspot Migration: For globally hot data, automatically create replicas across multiple storage nodes to avoid single-point access bottlenecks.
  • Inspirations: References data heat identification and hierarchical storage strategies from traditional distributed storage systems, and the multi-tenancy and cost-aware caching concept from Ant's AI Gateway, applying differentiated policies based on the business priority of cached data.
  • Challenge Analysis:
    • Quantifying data heat.
    • Interact with Mooncake-Store for data migration.

3.3 External Interface Design (Preliminary)

  • Request Reception Interface (REST/gRPC): Receives inference requests, including the prompt and parameters.
  • Resource Discovery and Status Collection Interface: Interacts with the Prefill/Decode resource pools and Mooncake Store to obtain real-time node health status, load metrics (GPU utilization, queue length), and KVCache distribution information.
  • Scheduling Decision Dispatch Interface: Dispatches assignment decisions to the designated Prefill and Decode instances.
  • Management API: Provides administrative functions such as cluster status queries and dynamic policy adjustments.

4. Archi-Level: Implementation Challenges and Mitigation Strategies

  1. Prediction Accuracy Challenge:
    • Challenge: Accurate prediction of TTFT and TBT is affected by network jitter, GPU computation variance, request characteristic uncertainty, etc. Prediction deviations may lead to scheduling errors.
    • Mitigation Strategies:
      • (Complex but Relatively Accurate) Employ lightweight machine learning models (e.g., Gradient Boosting Trees) or time series forecasting algorithms: Combine extensive historical monitoring data for offline training and online learning.
      • (Relatively Simple but Higher Error) Implement sliding window averaging and prediction confidence interval assessment: Adopt more conservative scheduling strategies when prediction uncertainty is high.
  2. State Consistency and Performance Overhead:
    • Challenge: The Conductor needs to maintain a global resource view. In large-scale clusters, frequent state synchronization can introduce significant network overhead (O(n²) pressure) and information latency, causing the scheduler to make decisions based on stale data.
    • Mitigation Strategies:
      • Adopt incremental updates and an event-driven state push mechanism, rather than full polling. Index updates are triggered locally only when KVCache blocks are created, migrated, or evicted, avoiding high-frequency full synchronization.
      • Design a global index based on a Block Radix Tree or similar structure, partitioning the global state by resource pool. The design philosophy aligns with ObjectMetadata in MasterService, recording only the location of KVCache blocks within the physical storage pool and key attributes (the Conductor only records node location, not Segment details). When a new request arrives, the Conductor can quickly determine the common prefix between its prompt and all KVCache blocks in the cluster by querying this tree, thus estimating the cache match degree for each candidate Prefill node.
      • Set a short-term validity for state information and develop compensation mechanisms for decisions based on slightly stale states. For scheduling decisions, the index allows for brief eventual consistency rather than requiring strong consistency (allowing ReplicaStatus to be Processing), which reduces synchronization complexity and overhead.
  3. Compatibility with Heterogeneous Inference Engines:
    • Challenge: How to provide a unified and efficient integration interface for inference engines with different architectures like vLLM and SGLang, considering the existing upstream service patterns (e.g., Proxy, P2P) implemented by these engines.
    • Mitigation Strategies:
      • The Conductor should be implemented as an RPC-based service, defining a set of standardized gRPC interfaces and data structures describing instance status, cache block information, task instructions, etc.
      • Provide client adapters for different engines, maintained collaboratively by the community.

5. Expected Benefits and Evaluation Metrics

  • Performance Improvement: Aim for a significant increase in system goodput (target >30%) under typical long-context workloads compared to a baseline version without the Conductor's intelligent scheduling, while maintaining TTFT and TBT SLO attainment rates (P99) above 99.9%.
  • Resource Utilization: Improve the overall utilization of GPU compute resources and storage resources through intelligent KVCache layout and load balancing.
  • System Stability: Enhance cluster robustness during traffic peaks through predictive early rejection and overload protection.

6. Development Plan (Preliminary)

  1. Community Discussion and Design Finalization: Thoroughly discuss this RFC within the community to refine interfaces and algorithms.
  2. Core Service Framework and Interface Design: Implement the Conductor service framework and basic communication modules.
  3. Scheduler Implementation: Implement the various Schedulers (Prefill, Decoding, KVCache Balance).
  4. Integration Testing and Performance Tuning: Conduct integration testing with Mooncake Store and mainstream inference engines like vLLM and SGLang, followed by performance tuning under simulated and real-world loads.
  5. Ongoing Maintenance: Continuously add support for interfaces to different inference frameworks.

7. Comparison with and Lessons Learned from Other Scheduling Frameworks

Comparison

Comparison Dimension Mooncake Conductor SGLang Router (NVIDIA) Dynamo (Ant Group) AI Gateway AIBrix
Core Positioning System-level Global Scheduler: KV Cache-centric, physically disaggregates Prefill and Decode stages. Central coordinator for the Mooncake KV-Cache-centric architecture, managing compute-storage synergy. Lightweight Routing Layer: Focuses on request distribution and load balancing across multiple nodes, often decoupled from inference engines. Cache-aware load balancer for multi-Worker environments, optimizing cross-request reuse for RadixAttention. Cluster-level Orchestration Framework Core: KV-Cache-aware intelligent router for distributed GPU clusters, minimizing KV Cache recomputation. Acts as the "OS for the AI factory," handling resource scheduling and routing in distributed environments. Enterprise-grade Model Gateway: For multi-model, multi-tenant production environments, integrating traffic management, cost control, and intelligent routing. vLLM's Cloud-native Control Plane: Handles resource orchestration at the K8s level, focusing on AI service pipeline (DAG) orchestration, scheduling request flow across models/processing units.
Relationship with KV Cache Direct Management & Scheduling: Directly manages the physically disaggregated distributed KV Cache pool (Mooncake Store), making decisions on Cache prefilling, migration, replication, and eviction. Indirect Awareness & Utilization: Estimates the match ratio of node-local KV Cache via a global approximate prefix tree, guiding requests for reuse. Global Awareness & Routing: Tracks KV Cache distribution cluster-wide via an intelligent router, routing new requests to nodes with high cache reuse potential. Business-layer Cache Awareness: Treats KV Cache state as a key input for intelligent routing decisions. No Direct Interaction: Scheduling occurs at the request level, not delving into fine-grained KV Cache management.
Relationship with Node Load Deep Integration & Prediction: Considers KV Cache location and dynamically monitors Prefill/Decode node load (queue depth, GPU utilization), employing prediction models to avoid overload and implement early rejection. Dynamic Load Balancing: Uses a hybrid strategy; prioritizes cache match when load is balanced, otherwise prioritizes the least loaded node. Primary goal is cache hit, load balance is secondary. Multi-objective Cost Function: Routing decisions balance KV cache overlap score and node real-time load (GPU utilization, queue length). Prioritizes KV Cache hit over choosing the idlest node. Multi-objective Trade-off: Real-time balancing of inference latency, throughput, API cost, and budget control in routing decisions, achieving load balance and cost optimization. Workflow Node Load Balancing: Schedules based on the current availability and performance of processing nodes within a pipeline.
Scheduling Scope Cross-Heterogeneous Resource Pools, End-to-End Co-scheduling: Spans Prefill cluster, Decode cluster, and distributed cache pool. Selects a Prefill-Decode node pair for a single request and manages KV Cache transfer between them. Multi-node Request Distribution: Scope is a pool of homogeneous or heterogeneous Worker nodes. Cluster-level Resource Optimization: Scope is a large GPU cluster. Gateway-level Traffic Governance: Scope is backend model instances or clusters, distributing and governing traffic among backend service instances. Workflow/Pipeline-level Orchestration: Scope is a DAG composed of multiple models or services.
Target Outcome Disaggregate Prefill/Decode via an independent KV Cache storage layer, enabling elastic scaling of compute resources and efficient long-context support. Increase Cache Reuse Rate, significantly reducing recomputation, especially in shared prefix scenarios. Accelerate inference and save compute by avoiding KV Cache recomputation. Guarantee Service SLOs, control costs, and achieve multi-tenant isolation. Achieve cost-efficiency and fairness for large-scale vLLM deployment.

Lessons Learned

Based on the comparative analysis above, the algorithm implementation for Mooncake Conductor can draw the following actionable insights from the scheduling strategies of other components:

Target Scheduler Inspirational Framework Core Adoptable Insight Mooncake Implementation Consideration
Cache-aware Prefill Scheduler NVIDIA Dynamo Global Cache State Awareness & Cost Function: Maintain a global prefix tree, compute the overlap score between requests and node caches, and select the optimal node via a cost function incorporating real-time node load (GPU util, queue length). Must adapt to Mooncake's disaggregated architecture. Cost function needs a "cache transfer time" dimension. Decision becomes selecting a "Prefill-Decode node pair", not just a single node.
SGLang Router Hybrid Routing & Anti-hotspot Strategy: Dynamically switch priority based on system load (cache-first when balanced, load-first when imbalanced), and randomly select from a low-load group to avoid new hotspots. Applicable for initial routing among Prefill instance groups, enabling fast, lightweight load distribution.
KVCache Balance Scheduler Traditional Distributed Storage Data Heat Identification & Tiered Storage: Use LRU/LFU etc. to identify hot data, automatically migrate it to fast storage (GPU HBM), and evict cold data to cheap storage (SSD). Core challenge is quantifying "heat". Requires a multi-dimensional definition based on access frequency, request SLO tier, business priority, etc.
Ant AI Gateway Multi-tenancy & Cost-aware Caching: Implement differentiated cache policies (TTL, replica count) for users/requests of varying importance, optimizing overall business value. Requires implementing fine-grained cache quota and priority management within Mooncake Store.
Load-balance Decoding Scheduler Ant AI Gateway Multi-objective Trade-off & SLO Guarantee: Integrate latency, throughput, cost, and SLA considerations into routing decisions, implementing early rejection for requests likely to violate SLOs. The core constraint for the Decoding scheduler is TBT SLO. Can integrate SLO compliance as a hard constraint.
SGLang Router Dynamic Load Assessment & Smooth Allocation: Calculate load score based on real-time task count and queue length, randomly selecting from the least loaded group for smooth distribution. Requires finer-grained load monitoring and prediction within the Mooncake Decode cluster to avoid imbalances due to information lag.

This proposal details the design rationale, core algorithms, implementation challenges, and plan. I kindly request the community Maintainers and developers to review this proposal and provide valuable feedback.

Before submitting a new issue...

  • [ ] Make sure you already searched for relevant issues and read the documentation

Asher-XunZhang avatar Oct 28 '25 11:10 Asher-XunZhang

The five aforementioned scheduling systems exhibit distinct design philosophies and primary focuses:

• The Mooncake Conductor assumes the most comprehensive role. It functions as a system-level master planner, orchestrating and deeply managing the three disaggregated resource pools: Prefill, Decode, and the KVCache storage.

• Both Ant Group's AI Gateway and SGLang Router act as traffic ingress points, but with different emphases. The Ant Group's AI Gateway is oriented towards being a mature enterprise-grade product, emphasizing multi-objective optimization encompassing performance, cost, and Service Level Objectives (SLOs). In contrast, the SGLang Router is more focused on addressing the specific technical challenge of cache-aware load balancing, making it a lighter-weight and more specialized component.

• The NVIDIA's Dynamo Smart Router and the intelligent routing gateway of AIBrix both serve larger-scale, distributed inference systems. Dynamo's strength lies in its modular design and potent global cache-state awareness. AIBrix, however, emphasizes deep integration with the Kubernetes ecosystem and optimizes GPU resource allocation efficiency through mechanisms like task prioritization and GPU affinity scheduling.

Asher-XunZhang avatar Oct 28 '25 11:10 Asher-XunZhang

Hi @Asher-XunZhang. Thank you for your comprehensive proposal! I understand that building the Conductor is a large-scale effort, but also a highly meaningful one. I'd like to ask a few questions regarding the roadmap:

  1. Which inference engines will Mooncake-Conductor integrate with first?
  2. What modifications will the current Mooncake-Store architecture require to support the Conductor's functionality?
  3. Since the Conductor includes many potential features, what components should be included in a minimal, runnable PoC (for example, service discovery and a 1P1D disaggregated router)?

Thanks again for this RFC.

chestnut-Q avatar Oct 29 '25 03:10 chestnut-Q

Hi @Asher-XunZhang. Thank you for your comprehensive proposal! I understand that building the Conductor is a large-scale effort, but also a highly meaningful one. I'd like to ask a few questions regarding the roadmap:

  1. Which inference engines will Mooncake-Conductor integrate with first?
  2. What modifications will the current Mooncake-Store architecture require to support the Conductor's functionality?
  3. Since the Conductor includes many potential features, what components should be included in a minimal, runnable PoC (for example, service discovery and a 1P1D disaggregated router)?

Thanks again for this RFC.

Thank you for your recognition of the RFC proposal and the valuable questions raised! Below, I will address your three questions one by one, based on my previous RFC:

  1. I intend to prioritize end-to-end integration with vLLM to form a demonstrable closed loop before gradually expanding the ecosystem. The reasons are as follows:vLLM currently lacks its own global scheduling and routing mechanism, whereas SGLang already has the SGLang Router. Therefore, we will first integrate a version of Conductor based on the implementation of the SGLang Router and the algorithms mentioned in the paper (while also considering scheduling ideas from other schedulers). This initial version will be adapted to vLLM to evaluate its benefits.vLLM natively supports accessing APIs such as /metricsto obtain metrics from deployed nodes, which is highly beneficial for the Conductor layer.After this, we will adapt to other general-purpose inference engines like SGLang and TensorRT-LLM, which either have their own global schedulers or are relatively less universal. An Adapter will be separated in the middle, following the same design philosophy as the transport component, allowing inference engine developers to adapt to mature ecosystem interfaces.
  2. My goal is to avoid disrupting the existing structure and design philosophy of Mooncake-Store as much as possible, striving to only add without deleting or modifying. Currently, the necessary modifications to the Mooncake-Store architecture involve two main points:
    • For the Cache-aware Prefill Scheduler: A lightweight metadata indexing service (Radix Tree) will be added to the Master Service to enable quick filtering of P-nodes with high hit rates in the Scheduler. However, the current challenge lies in the fact that, under the existing design, key calculation in the resource pool depends on intermediate steps at the inference engine layer. For example, encoding based on the deployed model’s vocabulary (consistency in model encoding—this data can be obtained in vLLM by calling the /tokenizeAPI of any P-node in a cluster deploying the same model), partitioning into minimal units according to the configured block size(consistency in block partitioning—currently, there is no elegant solution), and hashing each block(consistency in hash calculation—currently, there is no elegant solution). However, the inference engine layer, within the current framework design, only becomes perceivable after being assigned by the Conductor. In contrast, calculating the cache hit rate distribution for a global view must occur before the inference engine stage. This makes it challenging to elegantly maintain consistency with the computation method of prefix caching in the Prefill Instance that have not yet been assigned.
    • For the KVCache Balance Scheduler: A set of monitoring APIs needs to be exposed, enabling Conductor to query the overall status of Mooncake-Store in real time, such as storage tier (HBM/DRAM/SSD) usage, network throughput, and access frequency of cache blocks. This would provide decision-making basis for Conductor’s KVCache Balance Scheduler to perform operations like hot-spot migration and cold data eviction. Alternatively, the scheduling policies of this Scheduler could be implemented within the Master Service.
      • The advantage is a simpler design, reduced call layers, and no need to implement an efficient control plane communication protocol (e.g., gRPC-based) between Conductor and Mooncake-Store—hot-spot migration and other KV Cache scheduling behaviors could be handled directly by the Master Service.
      • The disadvantage is unclear functional module boundaries and tight coupling between scheduling and storage.
  3. I believe that, for Mooncake Conductor at this stage, a minimum viable prototype should focus on validating the core scheduling value of Conductor: specifically, whether intelligent "1 Prefill + 1 Decode" (1P1D) instance pairing and cache awareness can significantly improve system efficiency with minimal overhead. This PoC deliberately does not implement advanced features such as the complex KVCache Balance Scheduler or predictive early rejection. Upon the success of the PoC, we will gradually iterate to optimize more precise cache and load balancing algorithms, advanced cache management, and other complete functionalities.

I look forward to further suggestions and discussions!

Asher-XunZhang avatar Oct 29 '25 06:10 Asher-XunZhang

Hi @Asher-XunZhang. Thank you for your comprehensive proposal! I understand that building the Conductor is a large-scale effort, but also a highly meaningful one. I'd like to ask a few questions regarding the roadmap:

  1. Which inference engines will Mooncake-Conductor integrate with first?
  2. What modifications will the current Mooncake-Store architecture require to support the Conductor's functionality?
  3. Since the Conductor includes many potential features, what components should be included in a minimal, runnable PoC (for example, service discovery and a 1P1D disaggregated router)?

Thanks again for this RFC.

Thank you for your recognition of the RFC proposal and the valuable questions raised! Below, I will address your three questions one by one, based on my previous RFC:

  1. I intend to prioritize end-to-end integration with vLLM to form a demonstrable closed loop before gradually expanding the ecosystem. The reasons are as follows:vLLM currently lacks its own global scheduling and routing mechanism, whereas SGLang already has the SGLang Router. Therefore, we will first integrate a version of Conductor based on the implementation of the SGLang Router and the algorithms mentioned in the paper (while also considering scheduling ideas from other schedulers). This initial version will be adapted to vLLM to evaluate its benefits.vLLM natively supports accessing APIs such as /metricsto obtain metrics from deployed nodes, which is highly beneficial for the Conductor layer.After this, we will adapt to other general-purpose inference engines like SGLang and TensorRT-LLM, which either have their own global schedulers or are relatively less universal. An Adapter will be separated in the middle, following the same design philosophy as the transport component, allowing inference engine developers to adapt to mature ecosystem interfaces.

  2. My goal is to avoid disrupting the existing structure and design philosophy of Mooncake-Store as much as possible, striving to only add without deleting or modifying. Currently, the necessary modifications to the Mooncake-Store architecture involve two main points:

    • For the Cache-aware Prefill Scheduler: A lightweight metadata indexing service (Radix Tree) will be added to the Master Service to enable quick filtering of P-nodes with high hit rates in the Scheduler. However, the current challenge lies in the fact that, under the existing design, key calculation in the resource pool depends on intermediate steps at the inference engine layer. For example, encoding based on the deployed model’s vocabulary (consistency in model encoding—this data can be obtained in vLLM by calling the /tokenizeAPI of any P-node in a cluster deploying the same model), partitioning into minimal units according to the configured block size(consistency in block partitioning—currently, there is no elegant solution), and hashing each block(consistency in hash calculation—currently, there is no elegant solution). However, the inference engine layer, within the current framework design, only becomes perceivable after being assigned by the Conductor. In contrast, calculating the cache hit rate distribution for a global view must occur before the inference engine stage. This makes it challenging to elegantly maintain consistency with the computation method of prefix caching in the Prefill Instance that have not yet been assigned.

    • For the KVCache Balance Scheduler: A set of monitoring APIs needs to be exposed, enabling Conductor to query the overall status of Mooncake-Store in real time, such as storage tier (HBM/DRAM/SSD) usage, network throughput, and access frequency of cache blocks. This would provide decision-making basis for Conductor’s KVCache Balance Scheduler to perform operations like hot-spot migration and cold data eviction. Alternatively, the scheduling policies of this Scheduler could be implemented within the Master Service.

      • The advantage is a simpler design, reduced call layers, and no need to implement an efficient control plane communication protocol (e.g., gRPC-based) between Conductor and Mooncake-Store—hot-spot migration and other KV Cache scheduling behaviors could be handled directly by the Master Service.
      • The disadvantage is unclear functional module boundaries and tight coupling between scheduling and storage.
  3. I believe that, for Mooncake Conductor at this stage, a minimum viable prototype should focus on validating the core scheduling value of Conductor: specifically, whether intelligent "1 Prefill + 1 Decode" (1P1D) instance pairing and cache awareness can significantly improve system efficiency with minimal overhead. This PoC deliberately does not implement advanced features such as the complex KVCache Balance Scheduler or predictive early rejection. Upon the success of the PoC, we will gradually iterate to optimize more precise cache and load balancing algorithms, advanced cache management, and other complete functionalities.

I look forward to further suggestions and discussions!

Great! I think the main challenge is aligning the KVCache states between the Conductor and the inference engines, since there's currently no external interface for controlling KVCache at the engine level.

Perhaps we could start with a PoC PR in this repo first, and then open a roadmap issue to invite more contributors to join. BTW, I'd prefer to create a new repo for better project management, but not right now.

chestnut-Q avatar Oct 30 '25 03:10 chestnut-Q

Hi @Asher-XunZhang. Thank you for your comprehensive proposal! I understand that building the Conductor is a large-scale effort, but also a highly meaningful one. I'd like to ask a few questions regarding the roadmap:

  1. Which inference engines will Mooncake-Conductor integrate with first?
  2. What modifications will the current Mooncake-Store architecture require to support the Conductor's functionality?
  3. Since the Conductor includes many potential features, what components should be included in a minimal, runnable PoC (for example, service discovery and a 1P1D disaggregated router)?

Thanks again for this RFC.

Thank you for your recognition of the RFC proposal and the valuable questions raised! Below, I will address your three questions one by one, based on my previous RFC:

  1. I intend to prioritize end-to-end integration with vLLM to form a demonstrable closed loop before gradually expanding the ecosystem. The reasons are as follows:vLLM currently lacks its own global scheduling and routing mechanism, whereas SGLang already has the SGLang Router. Therefore, we will first integrate a version of Conductor based on the implementation of the SGLang Router and the algorithms mentioned in the paper (while also considering scheduling ideas from other schedulers). This initial version will be adapted to vLLM to evaluate its benefits.vLLM natively supports accessing APIs such as /metricsto obtain metrics from deployed nodes, which is highly beneficial for the Conductor layer.After this, we will adapt to other general-purpose inference engines like SGLang and TensorRT-LLM, which either have their own global schedulers or are relatively less universal. An Adapter will be separated in the middle, following the same design philosophy as the transport component, allowing inference engine developers to adapt to mature ecosystem interfaces.

  2. My goal is to avoid disrupting the existing structure and design philosophy of Mooncake-Store as much as possible, striving to only add without deleting or modifying. Currently, the necessary modifications to the Mooncake-Store architecture involve two main points:

    • For the Cache-aware Prefill Scheduler: A lightweight metadata indexing service (Radix Tree) will be added to the Master Service to enable quick filtering of P-nodes with high hit rates in the Scheduler. However, the current challenge lies in the fact that, under the existing design, key calculation in the resource pool depends on intermediate steps at the inference engine layer. For example, encoding based on the deployed model’s vocabulary (consistency in model encoding—this data can be obtained in vLLM by calling the /tokenizeAPI of any P-node in a cluster deploying the same model), partitioning into minimal units according to the configured block size(consistency in block partitioning—currently, there is no elegant solution), and hashing each block(consistency in hash calculation—currently, there is no elegant solution). However, the inference engine layer, within the current framework design, only becomes perceivable after being assigned by the Conductor. In contrast, calculating the cache hit rate distribution for a global view must occur before the inference engine stage. This makes it challenging to elegantly maintain consistency with the computation method of prefix caching in the Prefill Instance that have not yet been assigned.

    • For the KVCache Balance Scheduler: A set of monitoring APIs needs to be exposed, enabling Conductor to query the overall status of Mooncake-Store in real time, such as storage tier (HBM/DRAM/SSD) usage, network throughput, and access frequency of cache blocks. This would provide decision-making basis for Conductor’s KVCache Balance Scheduler to perform operations like hot-spot migration and cold data eviction. Alternatively, the scheduling policies of this Scheduler could be implemented within the Master Service.

      • The advantage is a simpler design, reduced call layers, and no need to implement an efficient control plane communication protocol (e.g., gRPC-based) between Conductor and Mooncake-Store—hot-spot migration and other KV Cache scheduling behaviors could be handled directly by the Master Service.
      • The disadvantage is unclear functional module boundaries and tight coupling between scheduling and storage.
  3. I believe that, for Mooncake Conductor at this stage, a minimum viable prototype should focus on validating the core scheduling value of Conductor: specifically, whether intelligent "1 Prefill + 1 Decode" (1P1D) instance pairing and cache awareness can significantly improve system efficiency with minimal overhead. This PoC deliberately does not implement advanced features such as the complex KVCache Balance Scheduler or predictive early rejection. Upon the success of the PoC, we will gradually iterate to optimize more precise cache and load balancing algorithms, advanced cache management, and other complete functionalities.

I look forward to further suggestions and discussions!

Great! I think the main challenge is aligning the KVCache states between the Conductor and the inference engines, since there's currently no external interface for controlling KVCache at the engine level.

Perhaps we could start with a PoC PR in this repo first, and then open a roadmap issue to invite more contributors to join. BTW, I'd prefer to create a new repo for better project management, but not right now.

I'm glad to continue discussing this topic! On the problem of adapting different inference engines, I envision that we can refer to the Transport in the Mooncake-Transfer-Engine component, so that our final implementation can only provide the adaptation for vLLM as an example, and adapt other frameworks later. Regarding my ideas, I have designed a preliminary architecture draft:

flowchart LR
    Input["Request Arrives<br>Contains Prompt and SLO"]
    
    subgraph Proxy["Reverse Proxy Core"]
        direction TB
        RequestParser["Request Parser & Queue"]
        SchedulerRouter["Scheduler Router"]
        DynamicWeightManager["DynamicWeightManager"]
        DecisionEngine["Decision Engine"]
        
        RequestParser --> SchedulerRouter
        SchedulerRouter --> CacheAwareScheduler
        SchedulerRouter --> LoadBalanceScheduler
        SchedulerRouter --> OtherSchedulers["Other Schedulers"]
        
        CacheAwareScheduler -->|Metrics| DynamicWeightManager
        LoadBalanceScheduler -->|Metrics| DynamicWeightManager
        OtherSchedulers -->|Metrics| DynamicWeightManager
        
        DynamicWeightManager --> DecisionEngine
    end
    
    subgraph CacheAwareScheduler["Cache-Aware Scheduler (KVEventManager)"]
        direction LR
        KVEventController["KV Event Manager"]
        ZeroMQReceiver["ZeroMQ Subscriber<br>(Listens KVEvents)"]
        Indexer["Indexer<br>(RadixTree Data Structure)"]
        
        KVEventController -->|Get| ZeroMQReceiver
        KVEventController -->|Updates| Indexer
    end
    
    subgraph ExternalComponents["External Components"]
        direction TB
        subgraph vLLMInstances["vLLM Instances"]
            vLLMZMQPublisher["zmq Publisher"]
        end
        
        subgraph MooncakeStore["Mooncake Store"]
            MooncakeStoreZMQPublisher["zmq Publisher"]
        end
        
        ZeroMQReceiver -->|Subscribe KVEvents| vLLMZMQPublisher
        ZeroMQReceiver -->|Subscribe KVEvents| MooncakeStoreZMQPublisher
    end
    
    subgraph FinalDecision["Instance"]
        direction TB
        DecisionEngine --> vLLMInstances1["vLLM Instances 1"]
        DecisionEngine --> vLLMInstances2["vLLM Instances 2"]
    end
    
    Input --> RequestParser
    
    classDef proxy fill:#e3f2fd,stroke:#1e88e5
    classDef scheduler fill:#e8f5e8,stroke:#43a047
    classDef external fill:#fff8e1,stroke:#ffa000
    
    class Proxy proxy
    class CacheAwareScheduler,LoadBalanceScheduler,OtherSchedulers scheduler
    class ExternalComponents external

Asher-XunZhang avatar Oct 30 '25 07:10 Asher-XunZhang

There are lots of protential algorithms for the Global Scheduler, the most important thing is providing a way to lookup the cache (i.e. query_global_prefix_tree), I suggest Mooncake provides a FullLookup feature like the PR in LMCache

chickeyton avatar Oct 31 '25 02:10 chickeyton

There are lots of protential algorithms for the Global Scheduler, the most important thing is providing a way to lookup the cache (i.e. query_global_prefix_tree), I suggest Mooncake provides a FullLookup feature like the PR in LMCache

Looks cool! I will implement this idea to the Global Cache Hit Distribution Collector module.

Asher-XunZhang avatar Oct 31 '25 09:10 Asher-XunZhang

There are lots of protential algorithms for the Global Scheduler, the most important thing is providing a way to lookup the cache (i.e. query_global_prefix_tree), I suggest Mooncake provides a FullLookup feature like the PR in LMCache

Looks cool! I will implement this idea to the Global Cache Hit Distribution Collector module.

I completely agree. This is also part of the aforementioned Mooncake Store modifications. Contributions are welcome!

chestnut-Q avatar Oct 31 '25 11:10 chestnut-Q

Currently, there are no histograms or other visualization functions for metrics such as kv cache hit rate. We believe it is necessary to add a module to Conductor to manage node and Mooncake metrics, while providing visualization solutions and providing formats compatible with Prometheus and general visualizations.

Liziqi-77 avatar Dec 03 '25 03:12 Liziqi-77