vllm
vllm copied to clipboard
`8-bit quantization` support
As far as I know vllm
and ray
doesn't support 8-bit quantization
as of now. I think it's the most viable quantization technique out there and should be implemented for faster inference and reduced memory usage.
This much-needed feature will enable a 13b model to fit into a single 24GB VRAM GPU.
Please consider 4-bit support as well. The new bitsandbytes library supports both 8 and 4-bit quantization.
Just that 8 bit is faster than 4 bit And vllm is about speed It would be really nice to at least get 8 bit
Just that 8 bit is faster than 4 bit And vllm is about speed It would be really nice to at least get 8 bit
Why is 8 bit faster than 4 bit?
8 bit is a well trade-off between speed and accuracy in practice, so 8bit support is strongly required to add to vllm
Would love to see this happening soon.
Must have feature. Will enable using mixtral decently on a single GPU 💯
Are you wanting load_in_8bit
from HF or would you consider the AWQ GPTQ support sufficient?
Are you wanting
load_in_8bit
from HF or would you consider the AWQ GPTQ support sufficient?
@hmellor Since AWQ is becoming more popular and GPTQ is supported in vLLM, I think it's sufficient for production use. Introducing an on-the-fly quantization method, like bitsandbytes or quanto, would be more user-friendly for research purposes.
Are you wanting
load_in_8bit
from HF or would you consider the AWQ GPTQ support sufficient?
@hmellor cloud compute costs adds for quantizing models to AWQ and GPTQ so having an "on the go" quantization method would be incredible.
I know this won't cover all situations but you could use models that have already been quantised and uploaded to Huggingface? (e.g. the almost 4000 quantised checkpoints uploaded by TheBloke https://huggingface.co/TheBloke)
@hmellor does models quantized by BnB and uploaded to hub work with vLLM?
The currently supported quantization schemes are GPTQ, AWQ, SqueezeLLM
Since 8-bit quantisation is already supported and all that's left unresolved for this issue is bitsandbytes
, I'm going to close this issue in favour of #4033