FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

fastchat-t5 quantization support?

Open bash99 opened this issue 2 years ago • 4 comments

Is there anyway to run it in 4G or less vram?

ggml? or gptq?

bash99 avatar May 07 '23 16:05 bash99

GGML - not yet - https://github.com/ggerganov/llama.cpp/issues/247 GPTQ - not really - you can quantize but it is not very good - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/157

bradfox2 avatar May 08 '23 04:05 bradfox2

You can try the default quantization method in FastChat (https://github.com/lm-sys/FastChat#no-enough-memory), but I haven't tested it so you probably need to fix some potential bugs.

merrymercy avatar May 08 '23 08:05 merrymercy

@bradfox2 Regarding GPTQ, is the performance degeneration specific to T5 or to all LLMs?

zhisbug avatar May 08 '23 09:05 zhisbug

@zhisbug AFAIK just T5

bradfox2 avatar May 08 '23 15:05 bradfox2

@merrymercy I tried the one in FastChat. It caused inf/nan element in the final output, will need to dig into it more.

DachengLi1 avatar May 22 '23 23:05 DachengLi1

The following converted and quantized model which run on cpu only should be helpful: https://huggingface.co/limcheekin/fastchat-t5-3b-ct2

limcheekin avatar Jun 24 '23 12:06 limcheekin

CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.

bradfox2 avatar Jun 29 '23 00:06 bradfox2

CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.

Appreciate if you could publish the evaluation metrics for CT2 vs GPTQ comparison. Kindly share what other 'degradation free' methods available here, it will benefit everyone following the thread.

Thanks.

limcheekin avatar Jun 29 '23 01:06 limcheekin