vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: bitsandbytes support

Open orellavie1212 opened this issue 1 year ago • 17 comments

🚀 The feature, motivation and pitch

Bitsandbytes 4bit quantization support. I know many want that, and also it is discuused before and marked as unplaned, but after I looked how TGI implemented that https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285 And TGI is based on VLLM ofc.

Alternatives

I know that GPTQ is better quan. compared to b&b 4b, but B&B is great for QLORA merged peft models, while it is almost impossible to gptq/awq quan. a b&b 4bit model (and I am not even talking about nf4 vs fp4 perpelxity case) as they are not offically supporting that (even though others sometimes successfully quantize from merged b&b qlora model to gptq or awq, but I don't for example)

Additional context

As I mentioned above, https://github.com/huggingface/text-generation-inference/blob/main/server/text_generation_server/utils/layers.py#L285 It looks very simple implementation of the Linear4bit class for b&b, I could add a pr myself to vllm, I just wondered why it is not added, maybe something I miss?

orellavie1212 avatar Apr 12 '24 14:04 orellavie1212

BNB 4-bit is a very useful feature. Many models don't have GPTQ or AWQ quantization versions, and it requires some hard work to quantize a large model using post-training methods.

Everyone know post-trianing quantization get better performance , but many guys like me doesn't care about the little performance loss when we try the demo product.

EvilPsyCHo avatar Apr 19 '24 17:04 EvilPsyCHo

After the release of Llama3, I only can play the 8B version with vLLM, and I have to switch to Ollama to run the 70B version.

EvilPsyCHo avatar Apr 19 '24 17:04 EvilPsyCHo

want +1

oushu1zhangxiangxuan1 avatar Apr 23 '24 03:04 oushu1zhangxiangxuan1

+1

kevaldekivadiya2415 avatar Apr 26 '24 06:04 kevaldekivadiya2415

want +1

Lu0Key avatar Apr 27 '24 11:04 Lu0Key

+1

Would be great to run CohereForAI/c4ai-command-r-plus-4bit.

timbmg avatar Apr 27 '24 11:04 timbmg

+1

cheney369 avatar Apr 30 '24 01:04 cheney369

+1

warlockedward avatar May 01 '24 14:05 warlockedward

+1

aaron-imani avatar May 01 '24 21:05 aaron-imani

It will be very usefull for QLORA finetunned models, is there a roadmap for this addition?

javierquin avatar May 02 '24 18:05 javierquin

+1

dhruvil237 avatar May 03 '24 09:05 dhruvil237

+1

dariemp avatar May 06 '24 15:05 dariemp

+1

qashzar avatar May 06 '24 21:05 qashzar

+1

salt00n9 avatar May 08 '24 09:05 salt00n9

Please stop commenting +1, just react to the original post with the thumbs up emoji. Commenting with such comment does not add any value and notifies all people subscribed to this issue.

qdm12 avatar May 10 '24 11:05 qdm12

Refer to : https://github.com/vllm-project/vllm/pull/4776

jeejeelee avatar May 13 '24 02:05 jeejeelee

want +1

Vegetable-Chicken-Coder avatar May 13 '24 07:05 Vegetable-Chicken-Coder

related to https://github.com/vllm-project/vllm/issues/3339

duchengyao avatar May 20 '24 03:05 duchengyao

What's required to implement this? FP4 and NF4 support?

It seems line bnb uses 2 esponent digits and 1 mantissa digit format for FP4. https://github.com/TimDettmers/bitsandbytes/blob/25abf8d95f8a33f38e2ce6f637768b442379ccd9/bitsandbytes/functional.py#L1049-L1059

epignatelli avatar May 20 '24 07:05 epignatelli

+1

flaviusburca avatar May 26 '24 18:05 flaviusburca

Hi, those who need this feature should check out what @chenqianfzh is working on here: https://github.com/vllm-project/vllm/pull/4776

jeejeelee avatar May 27 '24 02:05 jeejeelee

Hi Team when can we expect this feature ?

VpkPrasanna avatar Jun 07 '24 13:06 VpkPrasanna

+1 any update on this it seems @chenqianfzh https://github.com/vllm-project/vllm/pull/4776 is not working with LLAMA 3

devlup avatar Jul 01 '24 17:07 devlup

bitsandbytes is now supported https://docs.vllm.ai/en/latest/quantization/supported_hardware.html

hmellor avatar Jul 04 '24 13:07 hmellor

It's not working for LLama 3 , https://github.com/bd-iaas-us/vllm/blob/e16bcb69495540b21a3bd9423cdd5df8a78405ea/tests/quantization/test_bitsandbytes.py replace it with llama3 8b , it's failing the tests @hmellor @chenqianfzh

devlup avatar Jul 08 '24 15:07 devlup

@hmellor, how do you load in 8-bit? This version seems to only be able to load in 4-bit via quantization="bitsandbytes", load_format="bitsandbytes"?

junzhang-zj avatar Aug 17 '24 10:08 junzhang-zj