Ke Bao comments

Results 60 comments of


                                            Ke Bao

[Feature] TurboMind support W8A8 or FP8 KV Cache

[Feature] TurboMind support W8A8 or FP8 KV Cache

> OpenCompass team said WiC and WSC can be neglected OK, got it.

PyTorch Engine hash table based prefix caching

LGTM

Turbomind prefix caching

Benchmark with method mentioned in https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194. Settings: ``` engine: Turbomind model: llama2-13B-chat num prompts: 1000 ``` Use [LMDeploy benchmark script](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_restful_api.py) (used in https://github.com/InternLM/lmdeploy/pull/1429#issuecomment-2063156779): w/o prefix caching: ``` concurrency: 128 elapsed_time:...

Turbomind prefix caching

Evaluation result for Internlm2-7b with prefix caching: ``` dataset version metric mode internlm2-7b-turbomind -------------------------------------- --------- ------------- ------ ------------------------ --------- 考试 Exam --------- - - - - ceval - naive_average gen...

Turbomind prefix caching

> We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference. Regarding the prefix caching of both engines, I would like to...

Turbomind prefix caching

> There are definitely conflicts due to https://github.com/InternLM/lmdeploy/pull/1458 Got it. It seems no big conflict with this feature.

Turbomind prefix caching

The evaluation result for `Turbomind prefix caching` + `AWQ` + `online kv cache int4` + `tp2`: ``` dataset version metric mode internlm2-chat-7b-4bits-turbomind -------------------------------------- --------- ------------- ------ ---------------------------------- --------- 考试 Exam...

Turbomind prefix caching

> Also, in the current implementation, (re)-computation of shared blocks are not shared (even though the memory blocks are shared and may be re-written multiple times) @lzhangzz In current implementation,...

Turbomind prefix caching

> once a cache block is evicted, the sharing that block seems problematic. @lzhangzz In the current setting, only blocks with `use_count = 1` (only block trie holds the `use_count`)...