vllm
vllm copied to clipboard
Speed between gptq w4a16 and awq w4a16?
Hi, I am wondering the implementation of gptq w4a16(exllama) and awq w4a16(llm-awq), which is faster?
It seems the mathematical computation is similar between the two, so can these two share the same copy of cuda function?
Hoping for your reply, thank you
Hi @frankxyy, vLLM does not support GPTQ at the moment. We are actively working for the support, so please stay tuned.
Regarding your question, this is my understanding: While the performance highly depends on the kernel implementation, AWQ is meant to be (slightly) faster than GPTQ, when both are equally optimized. This is because GPTQ typically relies on group reordering, which makes its kernel logic and memory access pattern more complex. In contrast, AWQ does not use the reordering. Except for reordering, I believe the two methods behave equally at the inference time.
@WoosukKwon Got it! Thank you a lot for your detailed and clear explanation!
@WoosukKwon Hi, it seems that act-ordering of gptq just affect the behavior of quantizating, has no effect on inference time. Am I right?
Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance.
GPU
A40 48G VRAM
vLLM version
0.2.6
The latest version 0.2.7
will run out of memory for gptq.
AWQ
https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ using the main
revision of "4-bit, with Act Order. No group size, to lower VRAM requirements."
python benchmark_throughput.py \
--model Mixtral-8x7B-Instruct-v0.1-AWQ \
--backend vllm \
--input-len 128 \
--output-len 512 \
--quantization awq \
--num-prompts 50 \
--seed 1100 \
--dtype auto \
--dtype float16
Throughput: 0.51 requests/s, 327.98 tokens/s
GPTQ
https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
python benchmark_throughput.py \
--model ~/autodl-tmp/arkohut/Mixtral-8x7B-Instruct-v0.1-GPTQ \
--backend vllm \
--input-len 128 \
--output-len 512 \
--quantization gptq \
--num-prompts 50 \
--seed 1100 \
--dtype auto \
--dtype float16
Throughput: 0.28 requests/s, 176.88 tokens/s
Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance.
GPU
A40 48G VRAM
vLLM version
0.2.6
The latest version
0.2.7
will run out of memory for gptq.GPTQ
https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ using the
main
revision of "4-bit, with Act Order. No group size, to lower VRAM requirements."python benchmark_throughput.py \ --model Mixtral-8x7B-Instruct-v0.1-AWQ \ --backend vllm \ --input-len 128 \ --output-len 512 \ --quantization awq \ --num-prompts 50 \ --seed 1100 \ --dtype auto \ --dtype float16
Throughput: 0.51 requests/s, 327.98 tokens/s
AWQ
https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
python benchmark_throughput.py \ --model ~/autodl-tmp/arkohut/Mixtral-8x7B-Instruct-v0.1-GPTQ \ --backend vllm \ --input-len 128 \ --output-len 512 \ --quantization gptq \ --num-prompts 50 \ --seed 1100 \ --dtype auto \ --dtype float16
Throughput: 0.28 requests/s, 176.88 tokens/s
the kernel of AWQ maybe less optimized. Have you try squeezllm?
Sorry for the wrong info, during my test awq is much faster than gptq. I already updated the message.
I try gptq 、sqeezellm and awq quantization of llama7b, got quite different performance.
GPU A100 80G VRAM
vLLM version new version
AWQ
gptq
sqeezellm
fp16
So maybe the MoE model is quite different?
In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for shorter context length though, that's very strange. In my experience with 70b AWQ, latency and time-to-first-token is starts to take a nose dive after ~2500 context length.
I am following this PR with a lot of interest: https://github.com/vllm-project/vllm/pull/1508 -- it promises a speedup against fp16
In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for shorter context length though, that's very strange. In my experience with 70b AWQ, latency and time-to-first-token is starts to take a nose dive after ~2500 context length.
I am following this PR with a lot of interest: #1508 -- it promises a speedup against fp16
In my case, shorter context length the higher throughput. (qps is a wrong info)