FastChat fastchat-t5 quantization support?

Is there anyway to run it in 4G or less vram?

ggml? or gptq?

May 07 '23 16:05 bash99

GGML - not yet - https://github.com/ggerganov/llama.cpp/issues/247 GPTQ - not really - you can quantize but it is not very good - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/157

May 08 '23 04:05 bradfox2

You can try the default quantization method in FastChat (https://github.com/lm-sys/FastChat#no-enough-memory), but I haven't tested it so you probably need to fix some potential bugs.

May 08 '23 08:05 merrymercy

@bradfox2 Regarding GPTQ, is the performance degeneration specific to T5 or to all LLMs?

May 08 '23 09:05 zhisbug

@zhisbug AFAIK just T5

May 08 '23 15:05 bradfox2

@merrymercy I tried the one in FastChat. It caused inf/nan element in the final output, will need to dig into it more.

May 22 '23 23:05 DachengLi1

The following converted and quantized model which run on cpu only should be helpful: https://huggingface.co/limcheekin/fastchat-t5-3b-ct2

Jun 24 '23 12:06 limcheekin

CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.

Jun 29 '23 00:06 bradfox2

CT2's quantization method is not GPTQ or other 'degradation free' methods and has more severe performance penalties.

Appreciate if you could publish the evaluation metrics for CT2 vs GPTQ comparison. Kindly share what other 'degradation free' methods available here, it will benefit everyone following the thread.

Thanks.

Jun 29 '23 01:06 limcheekin