vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Open ShubhamVerma16 opened this issue 10 months ago • 9 comments

🚀 The feature, motivation and pitch

While running the vLLM server with quantized models specifying the quantization type, the below mentioned Warning is shown:

WARNING 04-25 12:26:07 config.py:169] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Is the a feature in progress or is there any workaround that can be done to handle the same. Let me know if any more details are required from my end.

Alternatives

No response

Additional context

No response

ShubhamVerma16 avatar Apr 25 '24 07:04 ShubhamVerma16

You can use the Marlin kernels for int4 inference

We have a PR to automatically support GPTQ models with Marlin. Should be merged imminently

https://github.com/vllm-project/vllm/pull/3922

robertgshaw2-redhat avatar Apr 25 '24 19:04 robertgshaw2-redhat

Would this have support for 8 bit quants as well or just 4 bit?

davidgxue avatar Apr 25 '24 19:04 davidgxue

Right now, 4 bits only. But @alexm-nm is working on 8 bit version of Marlin ATM and should be done relatively soon

Marlin supports act_order=True and grouping as well

robertgshaw2-redhat avatar Apr 25 '24 19:04 robertgshaw2-redhat

@davidgxue We have initial correctness on 8bit marlin, will do some perf checks and more testing and will put PR in a couple of days.

alexm-redhat avatar Apr 25 '24 20:04 alexm-redhat

Btw, the new 8bit marlin will support all group_sizes and act_order

alexm-redhat avatar Apr 25 '24 20:04 alexm-redhat

Awesome!! Thank you guys for the hard work!

davidgxue avatar Apr 26 '24 00:04 davidgxue

@davidgxue here we add 8-bit support https://github.com/vllm-project/vllm/pull/4533

alexm-redhat avatar May 01 '24 17:05 alexm-redhat

Thank you!!

davidgxue avatar May 01 '24 17:05 davidgxue

when can we expect awq models to be optimized for inference?

vidhyat98 avatar May 02 '24 18:05 vidhyat98

+1 :) (thank you for your work btw)

jugodfroy avatar Jun 03 '24 15:06 jugodfroy

@vidhyat98 AWQ is added to marlin.

alexm-redhat avatar Jul 22 '24 00:07 alexm-redhat

Resolved! :)

mgoin avatar Jul 25 '24 19:07 mgoin

so it looks like we should pass quantization='awq_marlin' for awq quantized models?

alexdauenhauer avatar Jul 25 '24 21:07 alexdauenhauer

so it looks like we should pass quantization='awq_marlin' for awq quantized models?

@alexdauenhauer you don't need to pass any quantization argument, in fact it is best if you don't! vLLM will automatically choose the best kernel it can use for your quantized model

mgoin avatar Jul 25 '24 21:07 mgoin

@mgoin great to know thanks!

alexdauenhauer avatar Jul 25 '24 22:07 alexdauenhauer