vllm icon indicating copy to clipboard operation
vllm copied to clipboard

`8-bit quantization` support

Open beratcmn opened this issue 1 year ago • 6 comments

As far as I know vllm and ray doesn't support 8-bit quantization as of now. I think it's the most viable quantization technique out there and should be implemented for faster inference and reduced memory usage.

beratcmn avatar Jun 22 '23 23:06 beratcmn

This much-needed feature will enable a 13b model to fit into a single 24GB VRAM GPU.

PenutChen avatar Jun 27 '23 01:06 PenutChen

Please consider 4-bit support as well. The new bitsandbytes library supports both 8 and 4-bit quantization.

gururise avatar Jul 07 '23 15:07 gururise

Just that 8 bit is faster than 4 bit And vllm is about speed It would be really nice to at least get 8 bit

ehartford avatar Sep 16 '23 02:09 ehartford

Just that 8 bit is faster than 4 bit And vllm is about speed It would be really nice to at least get 8 bit

Why is 8 bit faster than 4 bit?

FocusLiwen avatar Sep 22 '23 04:09 FocusLiwen

8 bit is a well trade-off between speed and accuracy in practice, so 8bit support is strongly required to add to vllm

wenmengzhou avatar Oct 16 '23 06:10 wenmengzhou

Would love to see this happening soon.

hikmet-demir avatar Jan 04 '24 22:01 hikmet-demir

Must have feature. Will enable using mixtral decently on a single GPU 💯

AntoninLeroy avatar Jan 15 '24 13:01 AntoninLeroy

Are you wanting load_in_8bit from HF or would you consider the AWQ GPTQ support sufficient?

hmellor avatar Mar 20 '24 12:03 hmellor

Are you wanting load_in_8bit from HF or would you consider the AWQ GPTQ support sufficient?

@hmellor Since AWQ is becoming more popular and GPTQ is supported in vLLM, I think it's sufficient for production use. Introducing an on-the-fly quantization method, like bitsandbytes or quanto, would be more user-friendly for research purposes.

PenutChen avatar Mar 21 '24 00:03 PenutChen

Are you wanting load_in_8bit from HF or would you consider the AWQ GPTQ support sufficient?

@hmellor cloud compute costs adds for quantizing models to AWQ and GPTQ so having an "on the go" quantization method would be incredible.

beratcmn avatar Mar 21 '24 21:03 beratcmn

I know this won't cover all situations but you could use models that have already been quantised and uploaded to Huggingface? (e.g. the almost 4000 quantised checkpoints uploaded by TheBloke https://huggingface.co/TheBloke)

hmellor avatar Mar 21 '24 22:03 hmellor

@hmellor does models quantized by BnB and uploaded to hub work with vLLM?

beratcmn avatar Mar 22 '24 19:03 beratcmn

The currently supported quantization schemes are GPTQ, AWQ, SqueezeLLM

hmellor avatar Mar 22 '24 19:03 hmellor

Since 8-bit quantisation is already supported and all that's left unresolved for this issue is bitsandbytes, I'm going to close this issue in favour of #4033

hmellor avatar Apr 18 '24 13:04 hmellor