GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Running on CPU
Is it possible to run quantization on CPU? Or quantize layer-by-layer without loading whole model in VRAM?
I want to quantize a large model, but it not fits in VRAM.
You can choose 2 options.
Run it to CPU using llama.cpp or
Use offloading.
python llama_inference_offload.py /output/path --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
I don't want to run it, I want to quantize model. Convert to 4 bit.
Currently i dont support cpu quantization. And the current method is layer by layer quantization.To quantize the LLaMa-65b you must have at least 24GB of vram.
I hope there will soon be a fix to this - to avoid OOM cuda errors - on quantization, as I'm having issues too. The auto-devices and --disk coding shouldn't work for this kind of quantization workings too? Sorry if it looks like a really dumb question 🥶 just thinking how awesome would be if this gets solved - and even if each quant process takes an eternity :)
At least for now, there is no plan to quantize by offloading from CPU, DISK.