llama-chat icon indicating copy to clipboard operation
llama-chat copied to clipboard

Do you have any plans to support GPTQ-4bit model?

Open forestsource opened this issue 2 years ago • 2 comments

It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa

forestsource avatar Mar 17 '23 06:03 forestsource

I'd also like to know how to do this. It seems the primary bottleneck is how fast the layers can be fed to the GPU. My copy load is at 80% while GPU load is at 10%. Is there a way to improve this somehow? I assume if we can get the layers quantized down to 1/4th the size it would be almost 4x faster.

Danielv123 avatar Mar 17 '23 08:03 Danielv123

It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa

this is meant for the bare weights

breadbrowser avatar Mar 17 '23 16:03 breadbrowser