GPTQ-for-LLaMa GPTQ+flexgen, is it possible?

GPTQ+flexgen, is it possible?

Open ye7iaserag opened this issue 2 years ago • 4 comments

Trying to get LLaMa 30B 4bit quantized to run with 12GB of vram and I'm hitting OOM since the model is a bit more than 16gb Is it possible to use offloading to load a percentage of the model to cpu using GPTQ?

Mar 17 '23 05:03 ye7iaserag

python llama_inference_offload.py /output/path --wbits 4 --load llama7b-4bit.pt --text "this is llama" --pre_layer 16 You can offloading it using the following command, but it is very slow.

Mar 17 '23 05:03 qwopqwop200

yes just saw your comment on https://github.com/oobabooga/text-generation-webui/issues/177 do you think that there is any current tech to make it faster? can this be combined with flexgen?

Mar 17 '23 05:03 ye7iaserag

@ye7iaserag yes, please! can you change the title for this to "GPTQ+flexgen" ? It is something to wonder about, In the future maybe we can run 1.4T models in 4bit with flexgen, or could we already run OPT175B model using flexgen with only 4gb vram and 50gb ram in GPTQ 4bit mode?

Mar 19 '23 12:03 BarfingLemurs

Sure I can, but I do not think that's possible... I think it has to be one of the 2

Mar 20 '23 07:03 ye7iaserag

It may be possible, but there is no plan to support it at the moment.

Apr 02 '23 03:04 qwopqwop200

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

GPTQ+flexgen, is it possible?

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard