Mooncake [RoadMap] Mooncake Store V2

This is a roadmap for Mooncake Store V2! If you're interested in contributing to any item, please let us know!

If any items you’re interested in are missing from the roadmap, your suggestions and input are highly encouraged! Please don’t hesitate to comment in this thread, submit a feature request, or establish an RFC.

Mooncake Store V2

LMCache Adaptor
- [ ] KVCache Reuse between prefill nodes
- [x] HTTP notification protocols (KVAdmitMsg/KVEvictMsg) with LMCache Controller @xiaguan
- [x] Mooncake-based BufferAllocator in LMCache layer to allocate MemoryObj (zero-copy) https://github.com/LMCache/LMCache/pull/642
SGLang Adaptor https://github.com/sgl-project/sglang/pull/7211
HiCache support https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md
Mooncake Client
- [ ] Client-side Buffer to reduce metadata footprint
- [x] Batch Put/Get interfaces for layer-wise. @zhaoyongke @xinranwang17 #380
- [x] Soft-pin mechanism to pin KVCache in the indicated node/layers. #587
- [x] Flexible APIs: removeAll, migrate #355
- [x] Implementing the basic metrics system on the client side. #733 #738
- [ ] User-friendly configuration method (using global yaml config or avoid configurations)
Mooncake Master
- [x] Lease designs to avoid conflicts between get/remove #374
- [x] High availability service with extra protocols #587 @ykwd
- [x] KVCache eviction mechanism #374 #287
- [x] 3FS backend for KVCache persistence. #437 #610 #690 @SgtPepperr
Multi-layer Storage
- [ ] VRAM storage support (GDR-based zero-copy)
- [x] GPU-based VRAM pool @XucSh #710
- [ ] SSD storage support #968
Benchmark and CI/CD
- [ ] KVCache workload emulation benchmark
- [ ] Stress benchmark for high concurrency
Revise Website Documentation
- [x] Create the Python binding documentation.
- [x] Create the Mooncake Master documentation.
- [x] Create documentation to assist users in integrating with LMCache. @XucSh #385
Deployment
- [ ] User-friendly configuration method (using global yaml config or avoiding configurations)
- [x] native K8S helm chart for running store (including P, D and master). https://github.com/sgl-project/rbg/pull/75
- [x] Integrating etcd service into mooncake store
Fault-tolerance and High Availability (HA)
- [x] Master failover: multiple backup instances, elect new leader when old one fails #451
- [ ] Master failover: kv metadata persistency #760
- [x] Client failover: tolerant client crash and network partition #501
- [x] Ensure correctness of get and put operations #374 #778 #993
- [x] Move stable HA-features (that do not depend on etcd) to non-HA mode #845

May 19 '25 16:05 stmatengss

Great! BTW, is cache-aware scheduler in the roadmap?

May 20 '25 01:05 zhaoyongke

Great! BTW, is cache-aware scheduler in the roadmap?

Good idea! Is the Cache-aware scheduler on the master side or another implementation?

May 20 '25 02:05 stmatengss

Great! BTW, is cache-aware scheduler in the roadmap?

Good idea! Is the Cache-aware scheduler on the master side or another implementation?

Both, mooncake master provides key location information for global scheduler. I'll provide another RFC later.

May 20 '25 02:05 zhaoyongke

Great! BTW, is cache-aware scheduler in the roadmap?

Good idea! Is the Cache-aware scheduler on the master side or another implementation?

Both, mooncake master provides key location information for global scheduler. I'll provide another RFC later.

support cache aware scheduler

May 20 '25 03:05 zhaoyongke

KVCache Reuse between prefill nodes

@stmatengss Hi, I wanna try it, thanks~

May 21 '25 12:05 hzh0425

KVCache Reuse between prefill nodes

@stmatengss Hi, I wanna try it, thanks~

Cool! Hope to see your PR soon!

May 21 '25 14:05 stmatengss

High availability service with extra protocols Integrating etcd service into mooncake store

Hi, I wanna try the High availability feature. And as this feature may depend on ETCD server, these two could be solved in the same PR.

May 23 '25 06:05 ykwd

SGLang Adaptor

We'll take this @huangtingwei9988

Jun 10 '25 09:06 zhaoyongke

High availability service with extra protocols Integrating etcd service into mooncake store

Hi, I wanna try the High availability feature. And as this feature may depend on ETCD server, these two could be solved in the same PR.

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Aug 07 '25 01:08 wwq2333

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Yes. Supporting KV metadata persistence is on our roadmap.

Aug 07 '25 02:08 ykwd

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Yes. Supporting KV metadata persistence is on our roadmap.

Hi, may I ask whether there has been any discussion or documentation concerning this approach? For example, would the metadata be stored in etcd? @ykwd

Aug 11 '25 14:08 SpecterCipher

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Yes. Supporting KV metadata persistence is on our roadmap.

Hi, may I ask whether there has been any discussion or documentation concerning this approach? For example, would the metadata be stored in etcd? @ykwd

@SpecterCipher There’s no official documentation yet. In the meantime, we’ve had some informal discussions about the design. Using etcd to store KV metadata is not preferred due to performance concerns. Instead, we’re considering having each client store the metadata for the KVs it holds. This would make metadata persistence independent of etcd or any other external service. However, this approach hasn’t been finalized yet.

Aug 12 '25 04:08 ykwd

Type A

This is a one-to-one PD disaggregated inference framework. Under this framework, the KVcache generated by GPU-a in the prefill node can only be transmitted to GPU-a in the decode node; in other words, GPU-a in the decode node can only access the memory of GPU-a in the prefill node.

Type B

This is a PD inference framework with a KVcache store. In this framework, the KVcache generated by the GPUs in the prefill node must first be transmitted to the KVcache store, and the framework will prioritize allocating the KVcache from the KVcache store to idle decode GPUs based on the busyness of the GPUs in the decode node.

May I ask which type of inference framework the mooncake belongs to?

In the P4D4 framework structure, the cache of prefill node 0 can only be accessed by decode node 0, rather than by "any arbitrary" decode node that can directly access the cache of prefill node 0, and it is not "shared across the entire cluster". Is my understanding correct?

Oct 14 '25 10:10 Alan-D-Chen

I hope my question hasn't caused you too much trouble. We students are fascinated by the wisdom of the mooncake authors and have developed a great interest in it.

@tchaikov @misterwilliam @simpx @karya0

Oct 14 '25 10:10 Alan-D-Chen

@Alan-D-Chen 目前两种方式都支持 sglang +hicache + mooncake 这个方案属于Type A

vllm + lmcache + mooncake 这个方案我们实现出来的是Type B

可以通过 vllm + mooncake_connector 实现Type A 相关iss https://github.com/kvcache-ai/Mooncake/pull/865

我的理解是 P4D4 的场景无论是TypeA 还是TypeB 都不是一一对应的，而是由proxy 来决定 prefill-1 处理完后接下来该有那个decode处理，比如 prefill-1处理后可能会有decode-2处理，说白了这是两个工作池，由proxy实现调度。

Oct 24 '25 02:10 tianlang-wq