GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
What would be required to quantize 65B model to 2-bit?
Presumably more than 130 GB of RAM? How much would it slow it down if using a swap file? Anything else? It seems like since GPTQ has the best results on larger models this should be looked into. It would be incredible to get almost the whole performance of the 65B model using only 16 GB vRAM.
I failed to quantize the model to 2-bit, I tried 33b, but in the end the output always broke, is it possible it is too small and only 65b is needed?
Don't bother, the result of 2bit 65b is really bad. If you're still interested, it took overnight with 24GB VRAM + 64GB RAM + unlimited swap on NVMe quad raid.