lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Bug] internvl2-2b使用awq量化后,推理速度基本上没有提升,精度还掉点

Open Howe-Young opened this issue 1 year ago • 4 comments

Checklist

  • [ ] 1. I have searched related issues but cannot get the expected help.
  • [ ] 2. The bug has not been fixed in the latest version.
  • [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

internvl2-2b使用awq w4a16量化后,推理速度基本上没有提升,精度还掉点,请问是什么原因,没有走到LMDeploy专门实现的w4a16的pattern吗?

Reproduction

机器:NVIDIA L40,CUDA 12.1 量化命令:

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 8 \
  --search-scale True \
  --work-dir $WORK_DIR

Environment

lmdeploy 0.5.2.post1

Error traceback

No response

Howe-Young avatar Aug 14 '24 10:08 Howe-Young

量化只针对llm,所以测速最好也是针对llm来进行。https://github.com/InternLM/lmdeploy/tree/main/benchmark

irexyc avatar Aug 15 '24 08:08 irexyc

量化只针对llm,所以测速最好也是针对llm来进行。https://github.com/InternLM/lmdeploy/tree/main/benchmark

对于vision module没有优化的话,llm优化了最终性能也会提升吧?

Howe-Young avatar Aug 15 '24 15:08 Howe-Young

vision的耗时跟patch的数量有关,如果vision部分的耗时占比很大, 那llm部分提升再多也不会很明显了。

对VLM的测速,目前没有一个标准。不同的测试方法可能会有不同的结论。你可以参考这个PR中的脚本进行测试。https://github.com/InternLM/lmdeploy/pull/1662

irexyc avatar Aug 15 '24 18:08 irexyc

请问 llama2-7b smooth量化 w8a8 后推理速度也没有提升,是什么原因呢 lmdeploy lite smooth_quant /model/llama2-7b-hf/ --work-dir /model/lmdeploy/llama2-7b-w8/ python benchmark/profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/llama2-7b-hf/ --backend pytorch

zxy1119 avatar Aug 16 '24 03:08 zxy1119