lmdeploy [Bug] internvl2-2b使用awq量化后，推理速度基本上没有提升，精度还掉点

Checklist

[ ] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

internvl2-2b使用awq w4a16量化后，推理速度基本上没有提升，精度还掉点，请问是什么原因，没有走到LMDeploy专门实现的w4a16的pattern吗？

Reproduction

机器：NVIDIA L40，CUDA 12.1 量化命令：

lmdeploy lite auto_awq \
   $HF_MODEL \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --batch-size 8 \
  --search-scale True \
  --work-dir $WORK_DIR

Environment

lmdeploy 0.5.2.post1

Error traceback

No response

Aug 14 '24 10:08 Howe-Young

量化只针对llm，所以测速最好也是针对llm来进行。https://github.com/InternLM/lmdeploy/tree/main/benchmark

Aug 15 '24 08:08 irexyc

量化只针对llm，所以测速最好也是针对llm来进行。https://github.com/InternLM/lmdeploy/tree/main/benchmark

对于vision module没有优化的话，llm优化了最终性能也会提升吧？

Aug 15 '24 15:08 Howe-Young

vision的耗时跟patch的数量有关，如果vision部分的耗时占比很大，那llm部分提升再多也不会很明显了。

对VLM的测速，目前没有一个标准。不同的测试方法可能会有不同的结论。你可以参考这个PR中的脚本进行测试。https://github.com/InternLM/lmdeploy/pull/1662

Aug 15 '24 18:08 irexyc

请问 llama2-7b smooth量化 w8a8 后推理速度也没有提升,是什么原因呢 lmdeploy lite smooth_quant /model/llama2-7b-hf/ --work-dir /model/lmdeploy/llama2-7b-w8/ python benchmark/profile_throughput.py /dataset/ShareGPT_V3_unfiltered_cleaned_split.json /model/llama2-7b-hf/ --backend pytorch

Aug 16 '24 03:08 zxy1119