q yao

Results 34 issues of q yao

- MLA implementation hinted by https://kexue.fm/archives/10091 . kv share same cache blocks. - support shared kv in paged attention to reduce smem usage. - `q_a_proj`, `kv_a_proj_with_mqa` in attention layer, `gate`...

optimization tp model loading. ## requirement - [x] #1520

Long context would move logits to host, which is time consuming. This PR will not output full logits if no request require `return_logits`. ![XgIBy2ZTet](https://github.com/InternLM/lmdeploy/assets/1239736/ea6db3c4-88ca-443b-a889-f349e3b153fd) `exp` is expensive in cuda (`ex2.approx.f32`)....

improvement

It is hard to switch kernel implementations in PyTorch Engine, and patching models of transformers makes it difficult for us to carry out more aggressive optimizations. This PR plan to...

triton3 has move the cuda fast math location. This PR support fast expf in paged attention with triton3.0. > [!NOTE] > None-cuda backend end might not work. The fill kv...

improvement

fix https://github.com/InternLM/lmdeploy/issues/2544

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

improvement

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...