text-generation-inference
text-generation-inference copied to clipboard
Support for 4bit quantization
Feature request
It seems we now have support for loading models using 4bit quantization starting from bitsandbytes>=0.39.0 Link: FP4 Quantization
Motivation
Running really large language models on smaller GPUs.
Your contribution
Plan should be to upgrade the bitsandbytes package and provide a ENV variable to control the type of quantization method to be used while running the server.
We're implementing GPTQ https://github.com/huggingface/text-generation-inference/pull/438 which to the best of my knowledge has better latency than bitsandbytes.
bitsandbytes 4-bit quantization allows the operation at loading time, without need to explicitly convert the model to another format. I am not sure if GPTQ does the same.
Nope GPTQ requires data, but the final latency is the key thing we're after. And bitsandbytes
8bit is really slow, not sure about the 4bit, but I'd imagine it's the same.
https://twitter.com/Tim_Dettmers/status/1676352492190433280?s=20
4bit will be 6.8x faster than before, so maybe we can discuss again about it if it is worth to replace the 8bit with 4bit after release of the new version
https://twitter.com/Tim_Dettmers/status/1677826353457143808
Release tomorrow, faster than 16 bit @Narsil Open to discuss again ?
We'll definitely bench it.
We did add PagedAttention because it did provide a lot of benefit.
In the layers file is linear8bit, but I can't find where it is used Using the search I only find the load in 8 bit param of the from pretrained function. Would it be more work than changing the from pretrained param to 4 bit ?
https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L133-L164
All the code is there indeed.
Is the reason of bitsandbytes 8bit
being slower than even default fp16 - the flash attention kernels?
No. bitsandbytes is slow because it does more computation afaik.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.