Mooncake icon indicating copy to clipboard operation
Mooncake copied to clipboard

[RoadMap] Mooncake Store V2

Open stmatengss opened this issue 7 months ago • 15 comments

This is a roadmap for Mooncake Store V2! If you're interested in contributing to any item, please let us know!

If any items you’re interested in are missing from the roadmap, your suggestions and input are highly encouraged! Please don’t hesitate to comment in this thread, submit a feature request, or establish an RFC.


Mooncake Store V2

  1. LMCache Adaptor
    • [ ] KVCache Reuse between prefill nodes
    • [x] HTTP notification protocols (KVAdmitMsg/KVEvictMsg) with LMCache Controller @xiaguan
    • [x] Mooncake-based BufferAllocator in LMCache layer to allocate MemoryObj (zero-copy) https://github.com/LMCache/LMCache/pull/642
  2. SGLang Adaptor https://github.com/sgl-project/sglang/pull/7211
  3. HiCache support https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md
  4. Mooncake Client
    • [ ] Client-side Buffer to reduce metadata footprint
    • [x] Batch Put/Get interfaces for layer-wise. @zhaoyongke @xinranwang17 #380
    • [x] Soft-pin mechanism to pin KVCache in the indicated node/layers. #587
    • [x] Flexible APIs: removeAll, migrate #355
    • [x] Implementing the basic metrics system on the client side. #733 #738
    • [ ] User-friendly configuration method (using global yaml config or avoid configurations)
  5. Mooncake Master
    • [x] Lease designs to avoid conflicts between get/remove #374
    • [x] High availability service with extra protocols #587 @ykwd
    • [x] KVCache eviction mechanism #374 #287
    • [x] 3FS backend for KVCache persistence. #437 #610 #690 @SgtPepperr
  6. Multi-layer Storage
    • [ ] VRAM storage support (GDR-based zero-copy)
    • [x] GPU-based VRAM pool @XucSh #710
    • [ ] SSD storage support #968
  7. Benchmark and CI/CD
    • [ ] KVCache workload emulation benchmark
    • [ ] Stress benchmark for high concurrency
  8. Revise Website Documentation
    • [x] Create the Python binding documentation.
    • [x] Create the Mooncake Master documentation.
    • [x] Create documentation to assist users in integrating with LMCache. @XucSh #385
  9. Deployment
    • [ ] User-friendly configuration method (using global yaml config or avoiding configurations)
    • [x] native K8S helm chart for running store (including P, D and master). https://github.com/sgl-project/rbg/pull/75
    • [x] Integrating etcd service into mooncake store
  10. Fault-tolerance and High Availability (HA)
    • [x] Master failover: multiple backup instances, elect new leader when old one fails #451
    • [ ] Master failover: kv metadata persistency #760
    • [x] Client failover: tolerant client crash and network partition #501
    • [x] Ensure correctness of get and put operations #374 #778 #993
    • [x] Move stable HA-features (that do not depend on etcd) to non-HA mode #845

stmatengss avatar May 19 '25 16:05 stmatengss

Great! BTW, is cache-aware scheduler in the roadmap?

zhaoyongke avatar May 20 '25 01:05 zhaoyongke

Great! BTW, is cache-aware scheduler in the roadmap?

Good idea! Is the Cache-aware scheduler on the master side or another implementation?

stmatengss avatar May 20 '25 02:05 stmatengss

Great! BTW, is cache-aware scheduler in the roadmap?

Good idea! Is the Cache-aware scheduler on the master side or another implementation?

Both, mooncake master provides key location information for global scheduler. I'll provide another RFC later.

zhaoyongke avatar May 20 '25 02:05 zhaoyongke

Great! BTW, is cache-aware scheduler in the roadmap?

Good idea! Is the Cache-aware scheduler on the master side or another implementation?

Both, mooncake master provides key location information for global scheduler. I'll provide another RFC later.

support cache aware scheduler

zhaoyongke avatar May 20 '25 03:05 zhaoyongke

  • KVCache Reuse between prefill nodes

@stmatengss Hi, I wanna try it, thanks~

hzh0425 avatar May 21 '25 12:05 hzh0425

  • KVCache Reuse between prefill nodes

@stmatengss Hi, I wanna try it, thanks~

Cool! Hope to see your PR soon!

stmatengss avatar May 21 '25 14:05 stmatengss

High availability service with extra protocols Integrating etcd service into mooncake store

Hi, I wanna try the High availability feature. And as this feature may depend on ETCD server, these two could be solved in the same PR.

ykwd avatar May 23 '25 06:05 ykwd

  1. SGLang Adaptor

We'll take this @huangtingwei9988

zhaoyongke avatar Jun 10 '25 09:06 zhaoyongke

High availability service with extra protocols Integrating etcd service into mooncake store

Hi, I wanna try the High availability feature. And as this feature may depend on ETCD server, these two could be solved in the same PR.

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

wwq2333 avatar Aug 07 '25 01:08 wwq2333

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Yes. Supporting KV metadata persistence is on our roadmap.

ykwd avatar Aug 07 '25 02:08 ykwd

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Yes. Supporting KV metadata persistence is on our roadmap.

Hi, may I ask whether there has been any discussion or documentation concerning this approach? For example, would the metadata be stored in etcd? @ykwd

SpecterCipher avatar Aug 11 '25 14:08 SpecterCipher

Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?

Yes. Supporting KV metadata persistence is on our roadmap.

Hi, may I ask whether there has been any discussion or documentation concerning this approach? For example, would the metadata be stored in etcd? @ykwd

@SpecterCipher There’s no official documentation yet. In the meantime, we’ve had some informal discussions about the design. Using etcd to store KV metadata is not preferred due to performance concerns. Instead, we’re considering having each client store the metadata for the KVs it holds. This would make metadata persistence independent of etcd or any other external service. However, this approach hasn’t been finalized yet.

ykwd avatar Aug 12 '25 04:08 ykwd

Image

Type A

This is a one-to-one PD disaggregated inference framework. Under this framework, the KVcache generated by GPU-a in the prefill node can only be transmitted to GPU-a in the decode node; in other words, GPU-a in the decode node can only access the memory of GPU-a in the prefill node.

Image

Type B

This is a PD inference framework with a KVcache store. In this framework, the KVcache generated by the GPUs in the prefill node must first be transmitted to the KVcache store, and the framework will prioritize allocating the KVcache from the KVcache store to idle decode GPUs based on the busyness of the GPUs in the decode node.

May I ask which type of inference framework the mooncake belongs to?

In the P4D4 framework structure, the cache of prefill node 0 can only be accessed by decode node 0, rather than by "any arbitrary" decode node that can directly access the cache of prefill node 0, and it is not "shared across the entire cluster". Is my understanding correct?

Alan-D-Chen avatar Oct 14 '25 10:10 Alan-D-Chen

I hope my question hasn't caused you too much trouble. We students are fascinated by the wisdom of the mooncake authors and have developed a great interest in it.

@tchaikov @misterwilliam @simpx @karya0

Alan-D-Chen avatar Oct 14 '25 10:10 Alan-D-Chen

@Alan-D-Chen 目前两种方式都支持 sglang +hicache + mooncake 这个方案属于Type A

vllm + lmcache + mooncake 这个方案我们实现出来的是Type B

可以通过 vllm + mooncake_connector 实现Type A 相关iss https://github.com/kvcache-ai/Mooncake/pull/865

我的理解是 P4D4 的场景无论是TypeA 还是TypeB 都不是 一一对应的,而是由proxy 来决定 prefill-1 处理完后接下来 该有那个decode处理,比如 prefill-1处理后 可能会有decode-2处理,说白了 这是两个工作池,由proxy实现调度。

tianlang-wq avatar Oct 24 '25 02:10 tianlang-wq