FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

GPTQ 4bit support

Open steppige opened this issue 1 year ago • 3 comments

Hi! Congratulations on your project... I think it's the only way to run FastChat with gptq's 4bit models.

Could you update to the latest version?

Thank you bye!

steppige avatar Apr 18 '23 13:04 steppige

I've tried several different ways of merging the GPTQ code with fastchat, but keep breaking down at running a 4 bit quantized model on multiple gpus. I go back and forth between memory access errors or errors about tensors being unexpectedly found on multiple devices (e.g. cuda:0 and cuda:1)

The Linux one click install for oobabooga works with quantized models across multiple gpus, so it's definitely possible. Anyone know what's missing?

FYI, I generally get the device error when using LlamaTokenizer.frompretrained, and the cuda memory access fault when using AutoTokenizer. Ooba seems to use LlamaTokenizer when using quantized models.

cidtrips avatar Apr 21 '23 04:04 cidtrips

Is this still being worked on? This would really help improve support for the models we can run with FastChat.

digisomni avatar May 04 '23 18:05 digisomni

I'll take a look and try this PR later this week.

zhisbug avatar May 08 '23 08:05 zhisbug

closed by #1209

merrymercy avatar Jun 09 '23 23:06 merrymercy