vllm
vllm copied to clipboard
[Feature]: bitsandbytes support
🚀 The feature, motivation and pitch
Bitsandbytes 4bit quantization support. I know many want that, and also it is discuused before and marked as unplaned, but after I looked how TGI implemented that https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285 And TGI is based on VLLM ofc.
Alternatives
I know that GPTQ is better quan. compared to b&b 4b, but B&B is great for QLORA merged peft models, while it is almost impossible to gptq/awq quan. a b&b 4bit model (and I am not even talking about nf4 vs fp4 perpelxity case) as they are not offically supporting that (even though others sometimes successfully quantize from merged b&b qlora model to gptq or awq, but I don't for example)
Additional context
As I mentioned above, https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285 It looks very simple implementation of the Linear4bit class for b&b, I could add a pr myself to vllm, I just wondered why it is not added, maybe something I miss?
BNB 4-bit is a very useful feature. Many models don't have GPTQ or AWQ quantization versions, and it requires some hard work to quantize a large model using post-training methods.
Everyone know post-trianing quantization get better performance , but many guys like me doesn't care about the little performance loss when we try the demo product.
After the release of Llama3, I only can play the 8B version with vLLM, and I have to switch to Ollama to run the 70B version.
want +1
+1
want +1
+1
+1
+1
It will be very usefull for QLORA finetunned models, is there a roadmap for this addition?
+1
+1
+1
+1
Please stop commenting +1, just react to the original post with the thumbs up emoji. Commenting with such comment does not add any value and notifies all people subscribed to this issue.
Refer to : https://github.com/vllm-project/vllm/pull/4776
want +1
related to https://github.com/vllm-project/vllm/issues/3339
What's required to implement this? FP4 and NF4 support?
It seems line bnb uses 2 esponent digits and 1 mantissa digit format for FP4. https://github.com/TimDettmers/bitsandbytes/blob/25abf8d95f8a33f38e2ce6f637768b442379ccd9/bitsandbytes/functional.py#L1049-L1059
+1
Hi, those who need this feature should check out what @chenqianfzh is working on here: https://github.com/vllm-project/vllm/pull/4776
Hi Team when can we expect this feature ?
+1 any update on this it seems @chenqianfzh https://github.com/vllm-project/vllm/pull/4776 is not working with LLAMA 3
bitsandbytes is now supported https://docs.vllm.ai/en/latest/quantization/supported_hardware.html
It's not working for LLama 3 , https://github.com/bd-iaas-us/vllm/blob/e16bcb69495540b21a3bd9423cdd5df8a78405ea/tests/quantization/test_bitsandbytes.py replace it with llama3 8b , it's failing the tests @hmellor @chenqianfzh
@hmellor, how do you load in 8-bit? This version seems to only be able to load in 4-bit via quantization="bitsandbytes", load_format="bitsandbytes"?