text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Support for 4bit quantization

Open rahuldshetty opened this issue 1 year ago • 11 comments

Feature request

It seems we now have support for loading models using 4bit quantization starting from bitsandbytes>=0.39.0 Link: FP4 Quantization

Motivation

Running really large language models on smaller GPUs.

Your contribution

Plan should be to upgrade the bitsandbytes package and provide a ENV variable to control the type of quantization method to be used while running the server.

rahuldshetty avatar Jun 13 '23 15:06 rahuldshetty

We're implementing GPTQ https://github.com/huggingface/text-generation-inference/pull/438 which to the best of my knowledge has better latency than bitsandbytes.

Narsil avatar Jun 14 '23 07:06 Narsil

bitsandbytes 4-bit quantization allows the operation at loading time, without need to explicitly convert the model to another format. I am not sure if GPTQ does the same.

captainst avatar Jun 18 '23 14:06 captainst

Nope GPTQ requires data, but the final latency is the key thing we're after. And bitsandbytes 8bit is really slow, not sure about the 4bit, but I'd imagine it's the same.

Narsil avatar Jun 19 '23 08:06 Narsil

https://twitter.com/Tim_Dettmers/status/1676352492190433280?s=20

4bit will be 6.8x faster than before, so maybe we can discuss again about it if it is worth to replace the 8bit with 4bit after release of the new version

flozi00 avatar Jul 05 '23 04:07 flozi00

https://twitter.com/Tim_Dettmers/status/1677826353457143808

Release tomorrow, faster than 16 bit @Narsil Open to discuss again ?

flozi00 avatar Jul 09 '23 20:07 flozi00

We'll definitely bench it.

We did add PagedAttention because it did provide a lot of benefit.

Narsil avatar Jul 10 '23 07:07 Narsil

In the layers file is linear8bit, but I can't find where it is used Using the search I only find the load in 8 bit param of the from pretrained function. Would it be more work than changing the from pretrained param to 4 bit ?

flozi00 avatar Jul 10 '23 08:07 flozi00

https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L133-L164

All the code is there indeed.

Narsil avatar Jul 10 '23 08:07 Narsil

Is the reason of bitsandbytes 8bit being slower than even default fp16 - the flash attention kernels?

ipoletaev avatar Jul 26 '23 04:07 ipoletaev

No. bitsandbytes is slow because it does more computation afaik.

Narsil avatar Jul 26 '23 08:07 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 26 '24 01:07 github-actions[bot]