engine_config = TurbomindEngineConfig(tp=2, quant_policy=0, cache_max_entry_count=0.2, session_len=4096)# quant_policy=8, self.pipe = pipeline("InternVL-Chat-V1-5", backend_config=engine_config) 其他配置参数不变，改变quant_policy=8，0，4 ，显存占用和推理速度没有任何改变是为什么呢？

Open YangYangTx opened this issue 1 year ago • 1 comments

          这是因为 lmdeploy 采用了"激进"的 kv cache mem分配策略

https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#usage

可以参考上面文档的说明

Originally posted by @lvhan028 in https://github.com/InternLM/lmdeploy/issues/1626#issuecomment-2122040558

May 22 '24 07:05 YangYangTx

          这是因为 lmdeploy 采用了"激进"的 kv cache mem分配策略
https://lmdeploy.readthedocs.io/en/latest/inference/pipeline.html#usage

可以参考上面文档的说明

Originally posted by @lvhan028 in #1626 (comment)

我采用了如下的KV-int4/int8离线推理方式： engine_config = TurbomindEngineConfig(quant_policy=4) # quant_policy=8 pipe = pipeline(model_path, backend_config=engine_config) 模型qwen2-7b包括测试数据均保持一致，发现推理速度中： 4bit time is: 367.8703472477694 Output tokens is: 212053.33333333334 IPS: 576.435 tokens/s 8bit: time is: 364.6410761587322 Output tokens is: 211456.66666666666 IPS: 579.904 tokens/s 原模型： time is: 128.10961544762054 Output tokens is: 215506.0 IPS: 1682.200 tokens/s （原模型采用的推理方式：pipe = pipeline(model_path)

请问这种情况是否符合预期呢

Jul 31 '24 12:07 JiaXinLI98