GPTQ-for-LLaMa What would be required to quantize 65B model to 2-bit?

What would be required to quantize 65B model to 2-bit?

Open Alcyon6 opened this issue 2 years ago • 2 comments

Presumably more than 130 GB of RAM? How much would it slow it down if using a swap file? Anything else? It seems like since GPTQ has the best results on larger models this should be looked into. It would be incredible to get almost the whole performance of the 65B model using only 16 GB vRAM.

Mar 16 '23 12:03 Alcyon6

I failed to quantize the model to 2-bit, I tried 33b, but in the end the output always broke, is it possible it is too small and only 65b is needed?

Mar 16 '23 14:03 nikitabalakin

Don't bother, the result of 2bit 65b is really bad. If you're still interested, it took overnight with 24GB VRAM + 64GB RAM + unlimited swap on NVMe quad raid.

Mar 18 '23 03:03 seggybop

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

What would be required to quantize 65B model to 2-bit?

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard