integrated into gpt-fast

Open kaizizzzzzz opened this issue 1 year ago • 1 comments

Is it possible to easily integrate hqq's quantization and forward into gpt-fast repo? In gpt-fast, there is int8, int4 quantization, i want to replace them with hqq and using hqq for low-bit inference while keep other structures unchanged. What is the easiest way to do this with least code change? Thanks for any valuable advice!

Sep 14 '24 04:09 kaizizzzzzz

It's already integrated in torchao: https://github.com/pytorch/ao/releases/tag/v0.5.0 So you just use quantize_(model, int4_weight_only(group_size, use_hqq=True) for example

Sep 14 '24 09:09 mobicham