lmdeploy [Bug] 为什么minicpm-v2_5 使用awq int4量化后速度比fp16慢三倍

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

lmdeploy fp16 推理时间2541 ms

量化后int4推理8221 ms

Reproduction

量化命令：

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-samples 128 \
  --calib-seqlen 200 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 128 \
  --search-scale False \
  --work-dir $WORK_DIR

推理代码

def lmdeploy_infer_awq_model():
    from PIL import Image
    from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
    gen_config = GenerationConfig(temperature=0.7, top_k=100, top_p=0.8, repetition_penalty=1.05)
    engine_config = TurbomindEngineConfig(model_format='awq')
    model_name = './MiniCPM-Llama3-V-2_5-4bit-a100-40g'
    pipe = pipeline(model_name, backend_config=engine_config)
    # image = Image.open('demo.png').convert('RGB')
    from lmdeploy.vl import load_image
    image = load_image('demo.png')
    response = pipe(('详细描述图片内容', image), gen_config=gen_config)
    print(response)
    t1 = time.time()
    for _ in range(loops):
        response = pipe(('详细描述图片内容', image), gen_config=gen_config)
    t2 = time.time()
    print(f'Inference {loops} times takes {t2 - t1} seconds')
    print(f"latency:  {(t2 - t1) / loops} seconds")

Environment

cuda11.8, lmdeploy 0.5.2.post

Error traceback

No response

Jul 31 '24 08:07 Howe-Young

确定生成的token数量一样多么

Jul 31 '24 09:07 lvhan028

gen_config = GenerationConfig(max_new_tokens=1000, ignore_eos=True) this will ensure the output token number is 1000.

Jul 31 '24 11:07 irexyc

确定生成的token数量一样多么

您好，我同样遇到了这个问题，我用的显卡是a800，是否是a800不支持量化加速呢，还是由于缺少什么算子问题呢？

我采用了如下的KV-int4/int8离线推理方式： engine_config = TurbomindEngineConfig(quant_policy=4) # quant_policy=8 pipe = pipeline(model_path, backend_config=engine_config) 模型qwen2-7b包括测试数据均保持一致，发现推理速度中： 4bit time is: 367.8703472477694 Output tokens is: 212053.33333333334 IPS: 576.435 tokens/s 8bit: time is: 364.6410761587322 Output tokens is: 211456.66666666666 IPS: 579.904 tokens/s 原模型： time is: 128.10961544762054 Output tokens is: 215506.0 IPS: 1682.200 tokens/s （原模型采用的推理方式：pipe = pipeline(model_path)

请问这种情况是否符合预期呢

Jul 31 '24 12:07 JiaXinLI98

公平的对比方法，应该是要保证引擎能生成指定数量的token。怎么让引擎做到这一点，就是像前文 @irexyc 说的那样设置 generation config 请问是否设置了呢？

Aug 01 '24 03:08 lvhan028

公平的对比方法，应该是要保证引擎能生成指定数量的token。怎么让引擎做到这一点，就是像前文 @irexyc 说的那样设置 generation config 请问是否设置了呢？

重新测了一下，设置了之后kv cache-int8和awq的速度与原模型保持一致，但是实际预测却相差很多，请问这个情况可能是什么原因导致的呢

Aug 01 '24 06:08 JiaXinLI98

我这边显卡是A800-80G，我发现跑完fp16版本模型后，再次跑awq w4a16会特别慢，如果直接跑awq w4a16的速度是比fp16快一倍的。Token输出awq略少一些，我测试的TPS如下(token per second)

fp16: 73.60
awq-w4a16: 127.39

但是现在不清楚为什么先跑fp16再跑awq会变慢（我是写在两个方法里了，调用完fp16才会调用awq的模型，所以按理说应该没有干扰）

Aug 01 '24 11:08 Howe-Young

我这边显卡是A800-80G，我发现跑完fp16版本模型后，再次跑awq w4a16会特别慢，如果直接跑awq w4a16的速度是比fp16快一倍的。Token输出awq略少一些，我测试的TPS如下(token per second)
fp16: 73.60
awq-w4a16: 127.39
但是现在不清楚为什么先跑fp16再跑awq会变慢（我是写在两个方法里了，调用完fp16才会调用awq的模型，所以按理说应该没有干扰）

我的显卡也是A800-80G

Aug 01 '24 11:08 JiaXinLI98

请问该问题解决了吗？

Oct 18 '24 03:10 guozhiyao

这个问题解决了吗

Sep 12 '25 04:09 David1-git