GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
GPTQ+flexgen, is it possible?
Trying to get LLaMa 30B 4bit quantized to run with 12GB of vram and I'm hitting OOM since the model is a bit more than 16gb Is it possible to use offloading to load a percentage of the model to cpu using GPTQ?
python llama_inference_offload.py /output/path --wbits 4 --load llama7b-4bit.pt --text "this is llama" --pre_layer 16
You can offloading it using the following command, but it is very slow.
yes just saw your comment on https://github.com/oobabooga/text-generation-webui/issues/177 do you think that there is any current tech to make it faster? can this be combined with flexgen?
@ye7iaserag yes, please! can you change the title for this to "GPTQ+flexgen" ? It is something to wonder about, In the future maybe we can run 1.4T models in 4bit with flexgen, or could we already run OPT175B model using flexgen with only 4gb vram and 50gb ram in GPTQ 4bit mode?
Sure I can, but I do not think that's possible... I think it has to be one of the 2
It may be possible, but there is no plan to support it at the moment.