text-generation-inference Support for 4bit quantization

Feature request

It seems we now have support for loading models using 4bit quantization starting from bitsandbytes>=0.39.0 Link: FP4 Quantization

Motivation

Running really large language models on smaller GPUs.

Your contribution

Plan should be to upgrade the bitsandbytes package and provide a ENV variable to control the type of quantization method to be used while running the server.

Jun 13 '23 15:06 rahuldshetty

We're implementing GPTQ https://github.com/huggingface/text-generation-inference/pull/438 which to the best of my knowledge has better latency than bitsandbytes.

Jun 14 '23 07:06 Narsil

bitsandbytes 4-bit quantization allows the operation at loading time, without need to explicitly convert the model to another format. I am not sure if GPTQ does the same.

Jun 18 '23 14:06 captainst

Nope GPTQ requires data, but the final latency is the key thing we're after. And bitsandbytes 8bit is really slow, not sure about the 4bit, but I'd imagine it's the same.

Jun 19 '23 08:06 Narsil

https://twitter.com/Tim_Dettmers/status/1676352492190433280?s=20

4bit will be 6.8x faster than before, so maybe we can discuss again about it if it is worth to replace the 8bit with 4bit after release of the new version

Jul 05 '23 04:07 flozi00

https://twitter.com/Tim_Dettmers/status/1677826353457143808

Release tomorrow, faster than 16 bit @Narsil Open to discuss again ?

Jul 09 '23 20:07 flozi00

We'll definitely bench it.

We did add PagedAttention because it did provide a lot of benefit.

Jul 10 '23 07:07 Narsil

In the layers file is linear8bit, but I can't find where it is used Using the search I only find the load in 8 bit param of the from pretrained function. Would it be more work than changing the from pretrained param to 4 bit ?

Jul 10 '23 08:07 flozi00

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L133-L164

All the code is there indeed.

Jul 10 '23 08:07 Narsil

Is the reason of bitsandbytes 8bit being slower than even default fp16 - the flash attention kernels?

Jul 26 '23 04:07 ipoletaev

No. bitsandbytes is slow because it does more computation afaik.

Jul 26 '23 08:07 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 26 '24 01:07 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Support for 4bit quantization

Feature request

Motivation

Your contribution

text-generation-inference
text-generation-inference copied to clipboard