[RoadMap] Mooncake Store V2
This is a roadmap for Mooncake Store V2! If you're interested in contributing to any item, please let us know!
If any items you’re interested in are missing from the roadmap, your suggestions and input are highly encouraged! Please don’t hesitate to comment in this thread, submit a feature request, or establish an RFC.
Mooncake Store V2
- LMCache Adaptor
- [ ] KVCache Reuse between prefill nodes
- [x] HTTP notification protocols (
KVAdmitMsg/KVEvictMsg) with LMCache Controller @xiaguan - [x] Mooncake-based BufferAllocator in LMCache layer to allocate MemoryObj (zero-copy) https://github.com/LMCache/LMCache/pull/642
- SGLang Adaptor https://github.com/sgl-project/sglang/pull/7211
- HiCache support https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/mem_cache/storage/mooncake_store/README.md
- Mooncake Client
- [ ] Client-side Buffer to reduce metadata footprint
- [x] Batch Put/Get interfaces for layer-wise. @zhaoyongke @xinranwang17 #380
- [x] Soft-pin mechanism to pin KVCache in the indicated node/layers. #587
- [x] Flexible APIs: removeAll, migrate #355
- [x] Implementing the basic metrics system on the client side. #733 #738
- [ ] User-friendly configuration method (using global yaml config or avoid configurations)
- Mooncake Master
- [x] Lease designs to avoid conflicts between get/remove #374
- [x] High availability service with extra protocols #587 @ykwd
- [x] KVCache eviction mechanism #374 #287
- [x] 3FS backend for KVCache persistence. #437 #610 #690 @SgtPepperr
- Multi-layer Storage
- [ ] VRAM storage support (GDR-based zero-copy)
- [x] GPU-based VRAM pool @XucSh #710
- [ ] SSD storage support #968
- Benchmark and CI/CD
- [ ] KVCache workload emulation benchmark
- [ ] Stress benchmark for high concurrency
- Revise Website Documentation
- [x] Create the Python binding documentation.
- [x] Create the Mooncake Master documentation.
- [x] Create documentation to assist users in integrating with LMCache. @XucSh #385
- Deployment
- [ ] User-friendly configuration method (using global yaml config or avoiding configurations)
- [x] native K8S helm chart for running store (including P, D and master). https://github.com/sgl-project/rbg/pull/75
- [x] Integrating etcd service into mooncake store
- Fault-tolerance and High Availability (HA)
- [x] Master failover: multiple backup instances, elect new leader when old one fails #451
- [ ] Master failover: kv metadata persistency #760
- [x] Client failover: tolerant client crash and network partition #501
- [x] Ensure correctness of get and put operations #374 #778 #993
- [x] Move stable HA-features (that do not depend on etcd) to non-HA mode #845
Great! BTW, is cache-aware scheduler in the roadmap?
Great! BTW, is cache-aware scheduler in the roadmap?
Good idea! Is the Cache-aware scheduler on the master side or another implementation?
Great! BTW, is cache-aware scheduler in the roadmap?
Good idea! Is the Cache-aware scheduler on the master side or another implementation?
Both, mooncake master provides key location information for global scheduler. I'll provide another RFC later.
Great! BTW, is cache-aware scheduler in the roadmap?
Good idea! Is the Cache-aware scheduler on the master side or another implementation?
Both, mooncake master provides key location information for global scheduler. I'll provide another RFC later.
- KVCache Reuse between prefill nodes
@stmatengss Hi, I wanna try it, thanks~
- KVCache Reuse between prefill nodes
@stmatengss Hi, I wanna try it, thanks~
Cool! Hope to see your PR soon!
High availability service with extra protocols Integrating etcd service into mooncake store
Hi, I wanna try the High availability feature. And as this feature may depend on ETCD server, these two could be solved in the same PR.
- SGLang Adaptor
We'll take this @huangtingwei9988
High availability service with extra protocols Integrating etcd service into mooncake store
Hi, I wanna try the High availability feature. And as this feature may depend on ETCD server, these two could be solved in the same PR.
Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?
Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?
Yes. Supporting KV metadata persistence is on our roadmap.
Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?
Yes. Supporting KV metadata persistence is on our roadmap.
Hi, may I ask whether there has been any discussion or documentation concerning this approach? For example, would the metadata be stored in etcd? @ykwd
Hi @ykwd @stmatengss, in the master's high availability design, is there a plan to implement persistence of various meta information?
Yes. Supporting KV metadata persistence is on our roadmap.
Hi, may I ask whether there has been any discussion or documentation concerning this approach? For example, would the metadata be stored in etcd? @ykwd
@SpecterCipher There’s no official documentation yet. In the meantime, we’ve had some informal discussions about the design. Using etcd to store KV metadata is not preferred due to performance concerns. Instead, we’re considering having each client store the metadata for the KVs it holds. This would make metadata persistence independent of etcd or any other external service. However, this approach hasn’t been finalized yet.
Type A
This is a one-to-one PD disaggregated inference framework. Under this framework, the KVcache generated by GPU-a in the prefill node can only be transmitted to GPU-a in the decode node; in other words, GPU-a in the decode node can only access the memory of GPU-a in the prefill node.
Type B
This is a PD inference framework with a KVcache store. In this framework, the KVcache generated by the GPUs in the prefill node must first be transmitted to the KVcache store, and the framework will prioritize allocating the KVcache from the KVcache store to idle decode GPUs based on the busyness of the GPUs in the decode node.
May I ask which type of inference framework the mooncake belongs to?
In the P4D4 framework structure, the cache of prefill node 0 can only be accessed by decode node 0, rather than by "any arbitrary" decode node that can directly access the cache of prefill node 0, and it is not "shared across the entire cluster". Is my understanding correct?
I hope my question hasn't caused you too much trouble. We students are fascinated by the wisdom of the mooncake authors and have developed a great interest in it.
@tchaikov @misterwilliam @simpx @karya0
@Alan-D-Chen 目前两种方式都支持 sglang +hicache + mooncake 这个方案属于Type A
vllm + lmcache + mooncake 这个方案我们实现出来的是Type B
可以通过 vllm + mooncake_connector 实现Type A 相关iss https://github.com/kvcache-ai/Mooncake/pull/865
我的理解是 P4D4 的场景无论是TypeA 还是TypeB 都不是 一一对应的,而是由proxy 来决定 prefill-1 处理完后接下来 该有那个decode处理,比如 prefill-1处理后 可能会有decode-2处理,说白了 这是两个工作池,由proxy实现调度。