vllm
vllm copied to clipboard
[Feature]: MLA Support
🚀 The feature, motivation and pitch
DeepSeek-V2 design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
Can VLLM support MLA for accelerated inference?
@misc{deepseek-v2, author = {DeepSeek-AI}, title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, year = {2024}, note = {GitHub repository}, url = {https://github.com/deepseek-ai/deepseek-v2} }
https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf
Alternatives
No response
Additional context
No response
mark
mark
mark
mark
mark
mark
mark
mark
mark
ref https://github.com/vllm-project/vllm/pull/4650#issuecomment-2297051077