vllm Speed between gptq w4a16 and awq w4a16?

Hi, I am wondering the implementation of gptq w4a16(exllama) and awq w4a16(llm-awq), which is faster?

It seems the mathematical computation is similar between the two, so can these two share the same copy of cuda function?

Hoping for your reply, thank you

Nov 30 '23 07:11 frankxyy

Hi @frankxyy, vLLM does not support GPTQ at the moment. We are actively working for the support, so please stay tuned.

Regarding your question, this is my understanding: While the performance highly depends on the kernel implementation, AWQ is meant to be (slightly) faster than GPTQ, when both are equally optimized. This is because GPTQ typically relies on group reordering, which makes its kernel logic and memory access pattern more complex. In contrast, AWQ does not use the reordering. Except for reordering, I believe the two methods behave equally at the inference time.

Nov 30 '23 07:11 WoosukKwon

@WoosukKwon Got it! Thank you a lot for your detailed and clear explanation!

Nov 30 '23 08:11 frankxyy

@WoosukKwon Hi, it seems that act-ordering of gptq just affect the behavior of quantizating, has no effect on inference time. Am I right?

Dec 03 '23 04:12 frankxyy

Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance.

GPU

A40 48G VRAM

vLLM version

0.2.6

The latest version 0.2.7 will run out of memory for gptq.

AWQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ using the main revision of "4-bit, with Act Order. No group size, to lower VRAM requirements."

python benchmark_throughput.py \
        --model Mixtral-8x7B-Instruct-v0.1-AWQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization awq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16

Throughput: 0.51 requests/s, 327.98 tokens/s

GPTQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ

python benchmark_throughput.py \
        --model ~/autodl-tmp/arkohut/Mixtral-8x7B-Instruct-v0.1-GPTQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization gptq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16

Throughput: 0.28 requests/s, 176.88 tokens/s

Jan 08 '24 05:01 arkohut

Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance.

GPU

A40 48G VRAM

vLLM version

0.2.6

The latest version 0.2.7 will run out of memory for gptq.

GPTQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ using the main revision of "4-bit, with Act Order. No group size, to lower VRAM requirements."
python benchmark_throughput.py \
        --model Mixtral-8x7B-Instruct-v0.1-AWQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization awq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16
Throughput: 0.51 requests/s, 327.98 tokens/s
AWQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ
python benchmark_throughput.py \
        --model ~/autodl-tmp/arkohut/Mixtral-8x7B-Instruct-v0.1-GPTQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization gptq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16
Throughput: 0.28 requests/s, 176.88 tokens/s

the kernel of AWQ maybe less optimized. Have you try squeezllm？

Jan 09 '24 11:01 shiqingzhangCSU

Sorry for the wrong info, during my test awq is much faster than gptq. I already updated the message.

Jan 10 '24 06:01 arkohut

I try gptq 、sqeezellm and awq quantization of llama7b, got quite different performance.

GPU A100 80G VRAM

vLLM version new version

AWQ

gptq

sqeezellm

fp16

Jan 10 '24 08:01 shiqingzhangCSU

So maybe the MoE model is quite different?

Jan 10 '24 17:01 arkohut

In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for shorter context length though, that's very strange. In my experience with 70b AWQ, latency and time-to-first-token is starts to take a nose dive after ~2500 context length.

I am following this PR with a lot of interest: https://github.com/vllm-project/vllm/pull/1508 -- it promises a speedup against fp16

Jan 15 '24 08:01 Palmik

In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for shorter context length though, that's very strange. In my experience with 70b AWQ, latency and time-to-first-token is starts to take a nose dive after ~2500 context length.

I am following this PR with a lot of interest: #1508 -- it promises a speedup against fp16

In my case， shorter context length the higher throughput. (qps is a wrong info)

Jan 15 '24 09:01 shiqingzhangCSU

vllm vllm copied to clipboard

Speed between gptq w4a16 and awq w4a16?

GPU

vLLM version

AWQ

GPTQ

GPU

vLLM version

GPTQ

AWQ

vllm
vllm copied to clipboard