MiniCPM icon indicating copy to clipboard operation
MiniCPM copied to clipboard

[Bug]: MLA实现没有带来任何收益

Open foamliu opened this issue 11 months ago • 0 comments

Is there an existing issue ? / 是否已有相关的 issue ?

  • [X] I have searched, and there is no existing issue. / 我已经搜索过了,没有相关的 issue。

Describe the bug / 描述这个 bug

MLA(multi head latent attention)的实现本来是为着提升推理速度,但由于存入缓存的数据比基线(Llama)更大,因此不但未带来任何收益,而且与基线(Llama)相比,占用显存更多,推理更慢。

To Reproduce / 如何复现

推理测速

Expected behavior / 期望的结果

提升推理速度

Screenshots / 截图

下面是 DeepSeekV3 HF官网的MLA实现,可见存入KVCache的数据量,比基线(Llama)还大 cba6bdda9920aacfab1acc96e21652a

Environment / 环境

- OS: [e.g. Ubuntu 20.04] 22.04
- Pytorch: [e.g. torch 2.0.0] 2.4.0
- CUDA: [e.g. CUDA 11.8] 12.1
- Device: [e.g. A10, RTX3090] A800

Additional context / 其他信息

image

foamliu avatar Jan 01 '25 05:01 foamliu