q yao issues

Results 34 issues of


                                            q yao

[Draft] Torch deepseek v2

- MLA implementation hinted by https://kexue.fm/archives/10091 . kv share same cache blocks. - support shared kv in paged attention to reduce smem usage. - `q_a_proj`, `kv_a_proj_with_mqa` in attention layer, `gate`...

Refactor load weights

optimization tp model loading. ## requirement - [x] #1520

torch engine optimize prefill for long context

Long context would move logits to host, which is time consuming. This PR will not output full logits if no request require `return_logits`. ![XgIBy2ZTet](https://github.com/InternLM/lmdeploy/assets/1239736/ea6db3c4-88ca-443b-a889-f349e3b153fd) `exp` is expensive in cuda (`ex2.approx.f32`)....

improvement

Custom backend support.

It is hard to switch kernel implementations in PyTorch Engine, and patching models of transformers makes it difficult for us to carry out more aggressive optimizations. This PR plan to...

optimize paged attention on triton3

triton3 has move the cuda fast math location. This PR support fast expf in paged attention with triton3.0. > [!NOTE] > None-cuda backend end might not work. The fill kv...

improvement

set capture mode thread_local

fix https://github.com/InternLM/lmdeploy/issues/2544

Refactor engine 2505

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

Blocked fp8 tma

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

improvement

Update batched dynamic ntk

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

[WIP] Deploy config updater

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...