Ke Bao

Results 9 issues of Ke Bao

## Motivation https://github.com/InternLM/lmdeploy/issues/1407 ## Modification - [x] Turbomind change - [ ] Add cli option after https://github.com/InternLM/lmdeploy/pull/1429 merged - [x] Benchmark and evaluation - [x] Compatibility testing with AWQ, online...

enhancement

### Motivation Prefix caching is supported in many projects such as vllm, sglang and rtp-llm. Torch engine is going to support this feature in https://github.com/InternLM/lmdeploy/pull/1393. So we raise this issue...

### Checklist - [X] 1. I have searched related issues but cannot get the expected help. - [X] 2. The bug has not been fixed in the latest version. ###...

## Motivation update doc for prefix caching ## Modification - add `turbomind_config.md` to index - add prefix cache introduction

documentation

### Motivation @lvhan028 @grimoire @lzhangzz Do you have plan to support [DeepSeek-V2](https://github.com/deepseek-ai/DeepSeek-V2) Model? ### Related resources _No response_ ### Additional context _No response_

## Motivation Optimize memory access for MLA/GQA/MQA decoding. ## Modification One block handle `BLOCK_H` q heads with shared k/v head. Inspired by https://github.com/InternLM/lmdeploy/pull/1649.

high priority
performance

## Motivation https://github.com/InternLM/lmdeploy/issues/1942 ## Modification Log prefix cache statistics. ## Checklist 1. Pre-commit or other linting tools are used to fix the potential lint issues. 2. The modification is covered...

## Motivation For draft decode in spec, the max batch size should be req_to_token_pool.size*topk.

[Here](https://github.com/flashinfer-ai/flashinfer/blob/e19926bf09b521e42e577f71c294f54e4f6a7a72/flashinfer/cute_dsl/blockscaled_gemm.py#L2665-L2689) we create the sfa_tensor with shape `(l, rm, rk, atom_m_0, atom_m_1, atom_k)` and order `(3, 4, 1, 5, 2, 0)`. So the stride should be `(atom_k*rk*atom_m_1, atom_k*rk*atom_m_1*l, atom_k, atom_k*rk*atom_m_1*l*rm,...

question
cute-dsl