llama-chat
llama-chat copied to clipboard
Do you have any plans to support GPTQ-4bit model?
It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa
I'd also like to know how to do this. It seems the primary bottleneck is how fast the layers can be fed to the GPU. My copy load is at 80% while GPU load is at 10%. Is there a way to improve this somehow? I assume if we can get the layers quantized down to 1/4th the size it would be almost 4x faster.
It seems that GPTQ-4bit model is already supported in this project. https://github.com/qwopqwop200/GPTQ-for-LLaMa
this is meant for the bare weights