lmdeploy [Feature] Support DeepSeek-V2 Model

Motivation

@lvhan028 @grimoire @lzhangzz Do you have plan to support DeepSeek-V2 Model?

Related resources

No response

Additional context

No response

May 08 '24 02:05 ispobock

Support of Deepseek v2 might take long time.

MoE model has not been optimized. I am working on a fused kernel like vllm
The model has so many weights, that loading the model and distributing it to ranks takes too much time. I need to find a new way to load tp model. That might lead to another refactor.

May 09 '24 11:05 grimoire

Support of Deepseek v2 might take long time.

In addition, DeepSeek V2 uses MLA(Multi-head Latent Attention) instead of traditional MHA or GQA. To achieve the best performance, it will also take some time.

May 10 '24 02:05 zhyncs

Hi Genius Li @lzhangzz Due to its excellent performance and effectiveness, DeepSeek V2 has become a new round of SOTA model. Does TurboMind have plans to support it? I know that supporting this on TurboMind is far more complex than PyTorch, but if possible, the performance would be very impressive.

May 10 '24 05:05 zhyncs

Mark! I am looking forward to integrating the deepseek-v2 model into the turbomind engine. The performance of deepseek-v2 is excellent!

May 11 '24 10:05 Vincent131499

@Vincent131499 We conducted an internal survey on MLA. Implementing it according to the method in the paper requires modifications to the kernel, which involves a significant amount of engineering work. If interested, you can refer to the blog for the derivation process.

May 16 '24 06:05 zhyncs