text-generation-webui
text-generation-webui copied to clipboard
server.py seems to ignore --gpu-memory when --load-in-4bit is specified as an option
> Can you try something like `--load-in-4bit --gpu-memory 6` to see if it works? `--auto-devices` has no effect for these 4-bit models.
https://github.com/oobabooga/text-generation-webui/blob/026d60bd3424b5426c5ef80632aa6b71fe12d4c5/modules/models.py#L90
Experiencing the same problem as @Titaniumtown here
--load-in-4bit seems to ignore --gpu-memory for some reason.
Running what you suggested on an RTX 3060 12GB with 32GB RAM and plenty of swap causes it to load the model to RAM (prints Done. in terminal), then it fills up the VRAM until it OOMs.
30b-4bit, main branch, commit de7dd8b6aa3aa00ba629c9ba6ce1bc32bd213d2f
Originally posted by @David-337 in https://github.com/oobabooga/text-generation-webui/issues/177#issuecomment-1464889717
I am also experiencing this issue. I am on main branch commit 316e07f06a67751d047c2072d8296d05bfb6a1c9.
I believe I'm experiencing the same issue. Nvidia RTX 3060 Ti 8GB, 32GB RAM, attempting to load the 13B parameter 4-bit model. Loading model succeeds (prints Done. in terminal) and then I get OOM upon attempting to generate anything.
I cant even run 30b-4bit Llama with 24Vram 3090Ti+32GB ram, I can run 13B natively. It ignores --disk and --cpu i think, just loading to vram and getting errors. 7B-4bit and 13B-4bit works great
More details on my situation: As before, Nvidia RTX 3060 Ti 8GB, 32GB RAM, attempting to load the 13B parameter 4-bit model.
I start the server as follows:
python server.py --load-in-4bit --model llama-13b-hf --auto-devices --gpu-memory 6
and get the CUDA OOM error upon pressing the "Generate" button.
I added a print statement here:
device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LLaMADecoderLayer"])
print(device_map)
model = accelerate.dispatch_model(model, device_map=device_map)
and got this:
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.layers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layers.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 'cpu', 'model.layers.37': 'cpu', 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'}
It looks like a reasonable device map is being generated, so does that mean the issue might be coming from the accelerate package where this gets passed to?
It's possible, @horenbergerb. The memory map is there, I don't know why it not being enforced.
I checked with nvidia-smi, and the VRAM is being totally occupied (~7444MiB / 8192MiB) when the model loads regardless of whether I specify --gpu-memory 2 or no restrictions.
Best evidence I've found so far regarding the accelerate repo: https://github.com/huggingface/accelerate/issues/1157 Which seems to indicate that there is not yet support for 4bit inference. I only vaguely understand the meaning of this terminology, so it could be unrelated.
I have a 3070 8GB VRAM and 32GB RAM and I am able to load the LLaMA 13b 4-bit model. In chat mode I can run a few messages back and forth and then it stops working, seemingly due to running out of VRAM.
Try decreasing the "Maximum prompt size in tokens" parameter.
This also seems to be happening for me with LLaMA 13b 4-bit model. I have 2 10GB cards and Im getting OUT OF MEMORY. nvidia-smi is only showing one card being used.
This is also happening with me on 12gb of vram with the llama 30b 4bit version even if i specify very low vram usage like --gpu-memory 2
The output on console also says
Loading llama-30b-hf...
Loading model ...
Done.
then it goes OOM after that. How can it go OOM after the model has been loaded?
Any plans on fixing?
I've also noticed --cpu will still OOM on vram which is illogical
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.