Ke Bao

Results 60 comments of Ke Bao

@grimoire > When do we need general cache? For example seq1: `xxxxyyyyzzzz`, seq2: `yyyyzzzz`, 4 tokens per block, for general cache, seq2 may use the last 2 cached blocks of...

[vllm](https://github.com/vllm-project/vllm/issues/2614#issue-2101770432) didn't take the radix tree implementation due to the hard maintenance: > ## Major benefits of this design over a KV block Trie > - Sometimes, caching is not...

Tested on A100-80G: DeepSeek-V2-Lite ``` ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: 128.0 Successful requests: 5000 Benchmark duration (s): 238.01 Total input tokens: 1187865 Total generated tokens:...

> Nice work! TLDR: Reuse from L2 to block. Is it right? @ispobock @zhyncs Previous version reuses from L2 cache. This version reuses shared k/v head from SMEM.

> I don't see any noticeable improvement in llama3 8b FP8. @81549361 Did you add `--disable-flashinfer ` for both branches on llama3?

@microwish Yeah, we did the profiling first and found the decoding kernel took most of the time. And then we checked the kernel with ncu and get some directions for...

@zhyncs Try to specify the `--dtype` as `float16` for T4. ref: https://github.com/state-spaces/mamba/issues/361#issuecomment-2181263738

@lvhan028 @AllentDan Could you help review this PR?Do you have any suggestions for the API changes mentioned in https://github.com/InternLM/lmdeploy/pull/2018#discussion_r1677097726?

@lvhan028 @josephrocca The `prefix_cached_tokens` is added to the `usage` (ref the following example). Please help check and review. server: ``` lmdeploy serve api_server /workdir/llama2_13b_chat --server-name 0.0.0.0 --server-port 23333 --tp 1...

@josephrocca Added usage to final stream response, pls check again