vllm `8-bit quantization` support

As far as I know vllm and ray doesn't support 8-bit quantization as of now. I think it's the most viable quantization technique out there and should be implemented for faster inference and reduced memory usage.

Jun 22 '23 23:06 beratcmn

This much-needed feature will enable a 13b model to fit into a single 24GB VRAM GPU.

Jun 27 '23 01:06 PenutChen

Please consider 4-bit support as well. The new bitsandbytes library supports both 8 and 4-bit quantization.

Jul 07 '23 15:07 gururise

Just that 8 bit is faster than 4 bit And vllm is about speed It would be really nice to at least get 8 bit

Sep 16 '23 02:09 ehartford

Just that 8 bit is faster than 4 bit And vllm is about speed It would be really nice to at least get 8 bit

Why is 8 bit faster than 4 bit?

Sep 22 '23 04:09 FocusLiwen

8 bit is a well trade-off between speed and accuracy in practice, so 8bit support is strongly required to add to vllm

Oct 16 '23 06:10 wenmengzhou

Would love to see this happening soon.

Jan 04 '24 22:01 hikmet-demir

Must have feature. Will enable using mixtral decently on a single GPU 💯

Jan 15 '24 13:01 AntoninLeroy

Are you wanting load_in_8bit from HF or would you consider the AWQ GPTQ support sufficient?

Mar 20 '24 12:03 hmellor

Are you wanting load_in_8bit from HF or would you consider the AWQ GPTQ support sufficient?

@hmellor Since AWQ is becoming more popular and GPTQ is supported in vLLM, I think it's sufficient for production use. Introducing an on-the-fly quantization method, like bitsandbytes or quanto, would be more user-friendly for research purposes.

Mar 21 '24 00:03 PenutChen

Are you wanting load_in_8bit from HF or would you consider the AWQ GPTQ support sufficient?

@hmellor cloud compute costs adds for quantizing models to AWQ and GPTQ so having an "on the go" quantization method would be incredible.

Mar 21 '24 21:03 beratcmn

I know this won't cover all situations but you could use models that have already been quantised and uploaded to Huggingface? (e.g. the almost 4000 quantised checkpoints uploaded by TheBloke https://huggingface.co/TheBloke)

Mar 21 '24 22:03 hmellor

@hmellor does models quantized by BnB and uploaded to hub work with vLLM?

Mar 22 '24 19:03 beratcmn

The currently supported quantization schemes are GPTQ, AWQ, SqueezeLLM

Mar 22 '24 19:03 hmellor

Since 8-bit quantisation is already supported and all that's left unresolved for this issue is bitsandbytes, I'm going to close this issue in favour of #4033

Apr 18 '24 13:04 hmellor

vllm vllm copied to clipboard

`8-bit quantization` support

vllm
vllm copied to clipboard