Ke Bao

Results 60 comments of Ke Bao

I evaluated the KV Cache INT8 in llama2 and llama3 models and get the following results: | dataset | metrics | llama2-13b-chat | llama2-13b-chat-kvint8 | llama3-8b | llama3-8b-kvint8 | llama3-80b...

> OpenCompass team said WiC and WSC can be neglected OK, got it.

Benchmark with method mentioned in https://github.com/InternLM/lmdeploy/issues/1407#issuecomment-2046702194. Settings: ``` engine: Turbomind model: llama2-13B-chat num prompts: 1000 ``` Use [LMDeploy benchmark script](https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_restful_api.py) (used in https://github.com/InternLM/lmdeploy/pull/1429#issuecomment-2063156779): w/o prefix caching: ``` concurrency: 128 elapsed_time:...

Evaluation result for Internlm2-7b with prefix caching: ``` dataset version metric mode internlm2-7b-turbomind -------------------------------------- --------- ------------- ------ ------------------------ --------- 考试 Exam --------- - - - - ceval - naive_average gen...

> We plan to release v0.4.0 next Tuesday, mainly focusing on new VLMs support and kv4/8 quantization and inference. Regarding the prefix caching of both engines, I would like to...

> There are definitely conflicts due to https://github.com/InternLM/lmdeploy/pull/1458 Got it. It seems no big conflict with this feature.

The evaluation result for `Turbomind prefix caching` + `AWQ` + `online kv cache int4` + `tp2`: ``` dataset version metric mode internlm2-chat-7b-4bits-turbomind -------------------------------------- --------- ------------- ------ ---------------------------------- --------- 考试 Exam...

> Also, in the current implementation, (re)-computation of shared blocks are not shared (even though the memory blocks are shared and may be re-written multiple times) @lzhangzz In current implementation,...

> once a cache block is evicted, the sharing that block seems problematic. @lzhangzz In the current setting, only blocks with `use_count = 1` (only block trie holds the `use_count`)...