unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

unsloth with vllm in 8/4 bits

Open quancore opened this issue 11 months ago • 20 comments

I have trained qlora model with unsloth and I want to serve with vllm but I did not found a way to serve model in8/4 bits ?

quancore avatar Mar 16 '24 22:03 quancore

@quancore I'm not sure / unsure if vLLM allows serving in 4 or 8 bits! 16bit yes, but unsure on 4 or 8

danielhanchen avatar Mar 17 '24 02:03 danielhanchen

@danielhanchen I think it is: https://github.com/vllm-project/vllm/issues/1155

quancore avatar Mar 17 '24 13:03 quancore

@danielhanchen I think it is: vllm-project/vllm#1155

Looks like they only support AWQ quantization not via bitsandbytes.

patleeman avatar Mar 19 '24 12:03 patleeman

@patleeman Oh ye AWQ is great - I'm assuming you want to quantize it to AWQ?

danielhanchen avatar Mar 20 '24 04:03 danielhanchen

@patleeman @danielhanchen well yes, maybe we should support AWQ so we can use qlora models with vllm?

quancore avatar Mar 20 '24 17:03 quancore

Hello there. I am also interested in using with VLLM a 8/4 bits model trained with Unsloth. Currently, it works fine with 16 bits but requires too much VRAM. Is there a way to quantize a model trained with Unsloth using AWQ or GPTQ?

marcelodiaz558 avatar Apr 06 '24 03:04 marcelodiaz558

Whoops this missed me - yep having an option to convert it to AWQ is interesting

danielhanchen avatar May 17 '24 17:05 danielhanchen

Whoops this missed me - yep having an option to convert it to AWQ is interesting

That would be amazing - is this a feature you are planning on adding in the near future?

Louis2B2G avatar Jun 05 '24 12:06 Louis2B2G

Yep for a future release!

danielhanchen avatar Jun 06 '24 16:06 danielhanchen

I'm down to volunteer to work on this, if you're accepting community contributions. (I have to do this for my day job anyway, so it might be nice to contribute to the library.)

amir-in-a-cynch avatar Jun 15 '24 23:06 amir-in-a-cynch

@amir-in-a-cynch do you plan to do it?

Serega6678 avatar Jun 24 '24 10:06 Serega6678

@amir-in-a-cynch do you plan to do it?

I'll take a stab at it tomorrow and wednesday. Not sure if it'll end up being a clean integration to the API for this library (since it adds a dependency), but at the worst case we should be able to get an example notebook together on how to do it for the docs.

amir-in-a-cynch avatar Jun 24 '24 11:06 amir-in-a-cynch

@amir-in-a-cynch great, keep me in touch I don't mind giving you a helping hand if you're stuck at some point

Serega6678 avatar Jun 24 '24 13:06 Serega6678

I think vLLM exporting to 8bits is through AWQ - you can also enable float8 support (if your GPU supports it)

danielhanchen avatar Jul 01 '24 00:07 danielhanchen