[Feature] Support LMCache
Motivation
LMDeploy has incredible throughput on smaller GPU's, it far outperforms VLLM with a much simpler setup (IMO). Because of this it is a favourite for smaller GPU's, the kind of GPU's where offloading of KV Cache to CPU / NVMe would be beneficial.
Supporing LMCache would be great for this (and the name matches!)
https://docs.lmcache.ai/
VLLM and other inference servers already support this (or have implemented their own).
Can we get this added?
Related resources
https://docs.vllm.ai/en/stable/examples/others/lmcache.html?h=lmcache
Additional context
No response
Thank you for your positive feedback on LMDeploy. We will look into lmcache and update you as soon as we have news.
Can you elaborate it a bit more why the support for LMCache is nescessary? In my point of view, currently LMDeploy already has:
- a built-in KV Cache management system with prefix caching and a lot of advanced memory management features.
- a unified KV Cache migration abstraction with Mooncake and DLSlime backends to support Prefill/Decode Disaggregation.
Is there anything I am missing?
Does this offer the ability to offload layers to CPU and have the KV cache shared efficiently? When I tried last it didnt?
Does this offer the ability to offload layers to CPU and have the KV cache shared efficiently? When I tried last it didnt?
In my opinion, the reason that LMDeploy does not provide offloading KV cache to CPU memory, is due to the poor performance of it. I do not think we can only implement it by LMCache. It would be much more simpler to implement directly in LMDeploy, but I can not find any real use for this considering the poor performance.
The poor performance is relative though, take the RTX 6000 - if you have a 110GB model that you want to use with the 96GB VRAM, offloading that final amount is negligible in terms of the raw processing power of the GPU itself.
Also for MoE models this would actually be very useful if the active experts can remain loaded to the GPU - the shuttling takes time but the actual processing still remains on the GPU right? I think llama.cpp does this with the --n-cpu-moe option.
This is relating to the layers not the KV Cache. The speed then of offloading KV can be improved when using multiple GPU's by a layer like LMCache cant it? My understanding is that most of the inefficiency comes from shuttling it back and forth leading to single digit tok/s values?
Plus when DDR6 becomes a thing I think you would want this ready for that? If LMCache isnt the way then implementing this in LMDeploy itself would be great but it does seem like LMCache already does this and thats why VLLM is using it?
The poor performance is relative though, take the RTX 6000 - if you have a 110GB model that you want to use with the 96GB VRAM, offloading that final amount is negligible in terms of the raw processing power of the GPU itself.
Also for MoE models this would actually be very useful if the active experts can remain loaded to the GPU - the shuttling takes time but the actual processing still remains on the GPU right? I think llama.cpp does this with the
--n-cpu-moeoption.This is relating to the layers not the KV Cache. The speed then of offloading KV can be improved when using multiple GPU's by a layer like LMCache cant it? My understanding is that most of the inefficiency comes from shuttling it back and forth leading to single digit tok/s values?
Plus when DDR6 becomes a thing I think you would want this ready for that? If LMCache isnt the way then implementing this in LMDeploy itself would be great but it does seem like LMCache already does this and thats why VLLM is using it?
As far as I know, LMDeploy has already implemented a wide range of overlapping optimizations in both its PyTorch and Turbomind backends. Honestly, I can’t imagine how much performance would suffer if we introduced host-device synchronization into the mix.
When comparing LMDeploy with other inference frameworks like llama.cpp, I believe we might be aiming at slightly different goals (feel free to correct me if I’m wrong). LMDeploy is focused on being production-ready and delivering industry-leading performance, whereas frameworks like llama.cpp tend to prioritize ease of use and are heavily optimized for personal or low-resource environments. So, keeping performance as a top priority for LMDeploy seems like the right approach.
That said, I’m not categorically opposed to integrating LMCache. In fact, I’d be glad to see someone from the community with deep knowledge of LMCache step up and propose a well-thought-out integration into LMDeploy’s existing PD (prefill-decode) separation abstraction. My main concern lies with cache offloading as it stands today—its maturity level could very well introduce performance regressions that outweigh the benefits, especially considering how much the LMDeploy community values performance above all else.