FastChat
FastChat copied to clipboard
GPTQ 4bit support
Hi! Congratulations on your project... I think it's the only way to run FastChat with gptq's 4bit models.
Could you update to the latest version?
Thank you bye!
I've tried several different ways of merging the GPTQ code with fastchat, but keep breaking down at running a 4 bit quantized model on multiple gpus. I go back and forth between memory access errors or errors about tensors being unexpectedly found on multiple devices (e.g. cuda:0 and cuda:1)
The Linux one click install for oobabooga works with quantized models across multiple gpus, so it's definitely possible. Anyone know what's missing?
FYI, I generally get the device error when using LlamaTokenizer.frompretrained, and the cuda memory access fault when using AutoTokenizer. Ooba seems to use LlamaTokenizer when using quantized models.
Is this still being worked on? This would really help improve support for the models we can run with FastChat.
I'll take a look and try this PR later this week.
closed by #1209