text-generation-inference
text-generation-inference copied to clipboard
QLora Support
Feature request
Add 4-bit quantization support when bitsandbytes releases.
Motivation
Run larger models easily and performantly
Your contribution
I could make a PR if this is a reasonably easy first task or 2.
Hello,
Wondering if you got any insights on what would be the inference performance of bitsandbytes 4bit compared to 8bit. Will it be any better? In my experience, 8bit is around 8x slower compared to fp16. Yet to try llama GPTQ(waiting for it be available in this server).
Hello,
Wondering if you got any insights on what would be the inference performance of bitsandbytes 4bit compared to 8bit. Will it be any better? In my experience, 8bit is around 8x slower compared to fp16. Yet to try llama GPTQ(waiting for it be available in this server).
Is it planned to support GPTQ models with this Server?
In my experience, 8bit is around 8x slower compared to fp16
Yes, bitsandbytes adds a lot of CPU bottleneck and the kernels are slower than the native ones. It is expected from this type of online quantization strategy.
what would be the inference performance of bitsandbytes 4bit compared to 8bit
We are working with the author of bnb but I don't have numbers ready to share at this moment.
GPTQ (waiting for it be available in this serve)
This will be available in the future. We need to iterate on the design a bit more but it is already powering some of our key Huggingface Inference API models.
Thanks @OlivierDehaene, is there any support for LORA?
Yep, 4bit inference with bnb is super slow.
GPTQ is pretty fast though. On my hardware it's actually faster than inferencing with fp16.
There's a high-level library called Autogptq (https://github.com/PanQiWei/AutoGPTQ) that makes adding GPTQ support just a couple lines (the original gptq-for-llama library is tougher to integrate and tends to have random breaking changes).
TLDR: Would love to GPTQ support added. It's the only way I can load larger models.
Would love to GPTQ support added
There is a PR opened for adding GPTQ support for llama #267 not sure if it will be amended to support all the other models as well. Eagerly waiting for this.
Not in this PR, this PR is the dirty work, there's a lot of legwork but yes all models will be supported as much out of the box as possible
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.