q yao
q yao
- MLA implementation hinted by https://kexue.fm/archives/10091 . kv share same cache blocks. - support shared kv in paged attention to reduce smem usage. - `q_a_proj`, `kv_a_proj_with_mqa` in attention layer, `gate`...
optimization tp model loading. ## requirement - [x] #1520
Long context would move logits to host, which is time consuming. This PR will not output full logits if no request require `return_logits`.  `exp` is expensive in cuda (`ex2.approx.f32`)....
It is hard to switch kernel implementations in PyTorch Engine, and patching models of transformers makes it difficult for us to carry out more aggressive optimizations. This PR plan to...
triton3 has move the cuda fast math location. This PR support fast expf in paged attention with triton3.0. > [!NOTE] > None-cuda backend end might not work. The fill kv...
fix https://github.com/InternLM/lmdeploy/issues/2544
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand...