text-generation-webui server.py seems to ignore --gpu-memory when --load-in-4bit is specified as an option

          > Can you try something like `--load-in-4bit --gpu-memory 6` to see if it works? `--auto-devices` has no effect for these 4-bit models.

https://github.com/oobabooga/text-generation-webui/blob/026d60bd3424b5426c5ef80632aa6b71fe12d4c5/modules/models.py#L90

Experiencing the same problem as @Titaniumtown here --load-in-4bit seems to ignore --gpu-memory for some reason.

Running what you suggested on an RTX 3060 12GB with 32GB RAM and plenty of swap causes it to load the model to RAM (prints Done. in terminal), then it fills up the VRAM until it OOMs.

30b-4bit, main branch, commit de7dd8b6aa3aa00ba629c9ba6ce1bc32bd213d2f

Originally posted by @David-337 in https://github.com/oobabooga/text-generation-webui/issues/177#issuecomment-1464889717

I am also experiencing this issue. I am on main branch commit 316e07f06a67751d047c2072d8296d05bfb6a1c9.

Mar 11 '23 23:03 hpnyaggerman

I believe I'm experiencing the same issue. Nvidia RTX 3060 Ti 8GB, 32GB RAM, attempting to load the 13B parameter 4-bit model. Loading model succeeds (prints Done. in terminal) and then I get OOM upon attempting to generate anything.

Mar 12 '23 04:03 horenbergerb

I cant even run 30b-4bit Llama with 24Vram 3090Ti+32GB ram, I can run 13B natively. It ignores --disk and --cpu i think, just loading to vram and getting errors. 7B-4bit and 13B-4bit works great

Mar 12 '23 12:03 iChristGit

More details on my situation: As before, Nvidia RTX 3060 Ti 8GB, 32GB RAM, attempting to load the 13B parameter 4-bit model.

I start the server as follows: python server.py --load-in-4bit --model llama-13b-hf --auto-devices --gpu-memory 6 and get the CUDA OOM error upon pressing the "Generate" button.

I added a print statement here:

            device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LLaMADecoderLayer"])
            print(device_map)
            model = accelerate.dispatch_model(model, device_map=device_map)

and got this:

{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 'cpu', 'model.layers.37': 'cpu', 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'}

It looks like a reasonable device map is being generated, so does that mean the issue might be coming from the accelerate package where this gets passed to?

Mar 12 '23 17:03 horenbergerb

It's possible, @horenbergerb. The memory map is there, I don't know why it not being enforced.

Mar 12 '23 17:03 oobabooga

I checked with nvidia-smi, and the VRAM is being totally occupied (~7444MiB / 8192MiB) when the model loads regardless of whether I specify --gpu-memory 2 or no restrictions.

Mar 12 '23 18:03 horenbergerb

Best evidence I've found so far regarding the accelerate repo: https://github.com/huggingface/accelerate/issues/1157 Which seems to indicate that there is not yet support for 4bit inference. I only vaguely understand the meaning of this terminology, so it could be unrelated.

Mar 12 '23 18:03 horenbergerb

I have a 3070 8GB VRAM and 32GB RAM and I am able to load the LLaMA 13b 4-bit model. In chat mode I can run a few messages back and forth and then it stops working, seemingly due to running out of VRAM.

Mar 12 '23 23:03 madmads11

Try decreasing the "Maximum prompt size in tokens" parameter.

Mar 13 '23 00:03 oobabooga

This also seems to be happening for me with LLaMA 13b 4-bit model. I have 2 10GB cards and Im getting OUT OF MEMORY. nvidia-smi is only showing one card being used.

Mar 17 '23 00:03 vukani-dev

This is also happening with me on 12gb of vram with the llama 30b 4bit version even if i specify very low vram usage like --gpu-memory 2 The output on console also says

Loading llama-30b-hf...
Loading model ...
Done.

then it goes OOM after that. How can it go OOM after the model has been loaded?

Mar 17 '23 05:03 ye7iaserag

Any plans on fixing?

Mar 18 '23 17:03 hpnyaggerman

I've also noticed --cpu will still OOM on vram which is illogical

Mar 18 '23 23:03 ye7iaserag

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Apr 19 '23 16:04 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

server.py seems to ignore --gpu-memory when --load-in-4bit is specified as an option

text-generation-webui
text-generation-webui copied to clipboard