GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

Running on CPU

Open mayaeary opened this issue 2 years ago • 3 comments

Is it possible to run quantization on CPU? Or quantize layer-by-layer without loading whole model in VRAM?

I want to quantize a large model, but it not fits in VRAM.

mayaeary avatar Mar 21 '23 18:03 mayaeary

You can choose 2 options. Run it to CPU using llama.cpp or Use offloading. python llama_inference_offload.py /output/path --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16

qwopqwop200 avatar Mar 22 '23 00:03 qwopqwop200

I don't want to run it, I want to quantize model. Convert to 4 bit.

mayaeary avatar Mar 22 '23 07:03 mayaeary

Currently i dont support cpu quantization. And the current method is layer by layer quantization.To quantize the LLaMa-65b you must have at least 24GB of vram.

qwopqwop200 avatar Mar 22 '23 08:03 qwopqwop200

I hope there will soon be a fix to this - to avoid OOM cuda errors - on quantization, as I'm having issues too. The auto-devices and --disk coding shouldn't work for this kind of quantization workings too? Sorry if it looks like a really dumb question 🥶 just thinking how awesome would be if this gets solved - and even if each quant process takes an eternity :)

Highlyhotgames avatar Mar 25 '23 11:03 Highlyhotgames

At least for now, there is no plan to quantize by offloading from CPU, DISK.

qwopqwop200 avatar Apr 02 '23 03:04 qwopqwop200