FastChat GPTQ 4bit support

GPTQ 4bit support

Open steppige opened this issue 1 year ago • 3 comments

Hi! Congratulations on your project... I think it's the only way to run FastChat with gptq's 4bit models.

Could you update to the latest version?

Thank you bye!

Apr 18 '23 13:04 steppige

I've tried several different ways of merging the GPTQ code with fastchat, but keep breaking down at running a 4 bit quantized model on multiple gpus. I go back and forth between memory access errors or errors about tensors being unexpectedly found on multiple devices (e.g. cuda:0 and cuda:1)

The Linux one click install for oobabooga works with quantized models across multiple gpus, so it's definitely possible. Anyone know what's missing?

FYI, I generally get the device error when using LlamaTokenizer.frompretrained, and the cuda memory access fault when using AutoTokenizer. Ooba seems to use LlamaTokenizer when using quantized models.

Apr 21 '23 04:04 cidtrips

Is this still being worked on? This would really help improve support for the models we can run with FastChat.

May 04 '23 18:05 digisomni

I'll take a look and try this PR later this week.

May 08 '23 08:05 zhisbug

closed by #1209

Jun 09 '23 23:06 merrymercy

FastChat FastChat copied to clipboard

GPTQ 4bit support

FastChat
FastChat copied to clipboard