text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

QLora Support

Open sam-h-bean opened this issue 1 year ago • 7 comments

Feature request

Add 4-bit quantization support when bitsandbytes releases.

Motivation

Run larger models easily and performantly

Your contribution

I could make a PR if this is a reasonably easy first task or 2.

sam-h-bean avatar May 30 '23 18:05 sam-h-bean

Hello,

Wondering if you got any insights on what would be the inference performance of bitsandbytes 4bit compared to 8bit. Will it be any better? In my experience, 8bit is around 8x slower compared to fp16. Yet to try llama GPTQ(waiting for it be available in this server).

gsaivinay avatar May 30 '23 18:05 gsaivinay

Hello,

Wondering if you got any insights on what would be the inference performance of bitsandbytes 4bit compared to 8bit. Will it be any better? In my experience, 8bit is around 8x slower compared to fp16. Yet to try llama GPTQ(waiting for it be available in this server).

Is it planned to support GPTQ models with this Server?

schauppi avatar May 31 '23 06:05 schauppi

In my experience, 8bit is around 8x slower compared to fp16

Yes, bitsandbytes adds a lot of CPU bottleneck and the kernels are slower than the native ones. It is expected from this type of online quantization strategy.

what would be the inference performance of bitsandbytes 4bit compared to 8bit

We are working with the author of bnb but I don't have numbers ready to share at this moment.

GPTQ (waiting for it be available in this serve)

This will be available in the future. We need to iterate on the design a bit more but it is already powering some of our key Huggingface Inference API models.

OlivierDehaene avatar May 31 '23 07:05 OlivierDehaene

Thanks @OlivierDehaene, is there any support for LORA?

tienthanhdhcn avatar May 31 '23 23:05 tienthanhdhcn

Yep, 4bit inference with bnb is super slow.

GPTQ is pretty fast though. On my hardware it's actually faster than inferencing with fp16.

There's a high-level library called Autogptq (https://github.com/PanQiWei/AutoGPTQ) that makes adding GPTQ support just a couple lines (the original gptq-for-llama library is tougher to integrate and tends to have random breaking changes).

TLDR: Would love to GPTQ support added. It's the only way I can load larger models.

LoopControl avatar Jun 05 '23 01:06 LoopControl

Would love to GPTQ support added

There is a PR opened for adding GPTQ support for llama #267 not sure if it will be amended to support all the other models as well. Eagerly waiting for this.

gsaivinay avatar Jun 06 '23 10:06 gsaivinay

Not in this PR, this PR is the dirty work, there's a lot of legwork but yes all models will be supported as much out of the box as possible

Narsil avatar Jun 06 '23 10:06 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 31 '24 01:07 github-actions[bot]