vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Speed between gptq w4a16 and awq w4a16?

Open frankxyy opened this issue 1 year ago • 4 comments

Hi, I am wondering the implementation of gptq w4a16(exllama) and awq w4a16(llm-awq), which is faster?

It seems the mathematical computation is similar between the two, so can these two share the same copy of cuda function?

Hoping for your reply, thank you

frankxyy avatar Nov 30 '23 07:11 frankxyy

Hi @frankxyy, vLLM does not support GPTQ at the moment. We are actively working for the support, so please stay tuned.

Regarding your question, this is my understanding: While the performance highly depends on the kernel implementation, AWQ is meant to be (slightly) faster than GPTQ, when both are equally optimized. This is because GPTQ typically relies on group reordering, which makes its kernel logic and memory access pattern more complex. In contrast, AWQ does not use the reordering. Except for reordering, I believe the two methods behave equally at the inference time.

WoosukKwon avatar Nov 30 '23 07:11 WoosukKwon

@WoosukKwon Got it! Thank you a lot for your detailed and clear explanation!

frankxyy avatar Nov 30 '23 08:11 frankxyy

@WoosukKwon Hi, it seems that act-ordering of gptq just affect the behavior of quantizating, has no effect on inference time. Am I right?

frankxyy avatar Dec 03 '23 04:12 frankxyy

Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance.

GPU

A40 48G VRAM

vLLM version

0.2.6

The latest version 0.2.7 will run out of memory for gptq.

AWQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ using the main revision of "4-bit, with Act Order. No group size, to lower VRAM requirements."

python benchmark_throughput.py \
        --model Mixtral-8x7B-Instruct-v0.1-AWQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization awq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16
Throughput: 0.51 requests/s, 327.98 tokens/s

GPTQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ

python benchmark_throughput.py \
        --model ~/autodl-tmp/arkohut/Mixtral-8x7B-Instruct-v0.1-GPTQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization gptq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16
Throughput: 0.28 requests/s, 176.88 tokens/s

arkohut avatar Jan 08 '24 05:01 arkohut

Try gptq and awq quantization of Mixtral-8x7B-Instruct-v0.1 got quite different performance.

GPU

A40 48G VRAM

vLLM version

0.2.6

The latest version 0.2.7 will run out of memory for gptq.

GPTQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ using the main revision of "4-bit, with Act Order. No group size, to lower VRAM requirements."

python benchmark_throughput.py \
        --model Mixtral-8x7B-Instruct-v0.1-AWQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization awq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16
Throughput: 0.51 requests/s, 327.98 tokens/s

AWQ

https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ

python benchmark_throughput.py \
        --model ~/autodl-tmp/arkohut/Mixtral-8x7B-Instruct-v0.1-GPTQ \
        --backend vllm \
        --input-len 128 \
        --output-len 512 \
        --quantization gptq \
        --num-prompts 50 \
        --seed 1100 \
        --dtype auto \
        --dtype float16
Throughput: 0.28 requests/s, 176.88 tokens/s

the kernel of AWQ maybe less optimized. Have you try squeezllm?

shiqingzhangCSU avatar Jan 09 '24 11:01 shiqingzhangCSU

Sorry for the wrong info, during my test awq is much faster than gptq. I already updated the message.

arkohut avatar Jan 10 '24 06:01 arkohut

I try gptq 、sqeezellm and awq quantization of llama7b, got quite different performance.

GPU A100 80G VRAM

vLLM version new version

AWQ image

gptq image

sqeezellm image

fp16 image

shiqingzhangCSU avatar Jan 10 '24 08:01 shiqingzhangCSU

So maybe the MoE model is quite different?

arkohut avatar Jan 10 '24 17:01 arkohut

In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for shorter context length though, that's very strange. In my experience with 70b AWQ, latency and time-to-first-token is starts to take a nose dive after ~2500 context length.

I am following this PR with a lot of interest: https://github.com/vllm-project/vllm/pull/1508 -- it promises a speedup against fp16

Palmik avatar Jan 15 '24 08:01 Palmik

In @shiqingzhangCSU bench AWQ is also faster (though a bit less so, which might be understandable given it's a smaller model). I wonder why @shiqingzhangCSU sees worse throughput for shorter context length though, that's very strange. In my experience with 70b AWQ, latency and time-to-first-token is starts to take a nose dive after ~2500 context length.

I am following this PR with a lot of interest: #1508 -- it promises a speedup against fp16

In my case, shorter context length the higher throughput. (qps is a wrong info)

shiqingzhangCSU avatar Jan 15 '24 09:01 shiqingzhangCSU