vllm [Feature]: MLA Support

[Feature]: MLA Support

Open chengtbf opened this issue 9 months ago • 4 comments

🚀 The feature, motivation and pitch

DeepSeek-V2 design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.

Can VLLM support MLA for accelerated inference?

@misc{deepseek-v2, author = {DeepSeek-AI}, title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, year = {2024}, note = {GitHub repository}, url = {https://github.com/deepseek-ai/deepseek-v2} }

https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf