text-generation-webui
text-generation-webui copied to clipboard
4bit LLaMA-30B: Out of memory
Dear All, I'm running 30B in 4bit on my 4090 24 GB + Ryzen 7700X and 64GB ram
after generating some tokens asking to produce code I get out of memory errors
using --gpu-memory has no effects
server line
python server.py --auto-devices --gpu-memory 20 --load-in-4bit --cai-chat --listen --extensions gallery llama_prompts --model llama-30b-4bit
Using a reddit-found character and preset:
{
"char_name": "LLaMA-Precise",
"char_persona": "LLaMA-Precise is a helpful AI chatbot that always provides useful and detailed answers to User's requests and questions. LLaMA-Precise tries to be as informative and friendly as possible.",
"char_greeting": "Hello! I am LLaMA-Precise, your informative assistant. How may I help you today?",
"world_scenario": "",
"example_dialogue": "{{user}}: Hi. Can you help me with something?\n{{char}}: Hello, this is LLaMA-Precise. How can I help?\n{{user}}: Have you heard of the latest nuclear fusion experiment from South Korea? I heard their experiment got hotter than the sun.\n{{char}}: Yes, I have heard about the experiment. Scientists in South Korea have managed to sustain a nuclear fusion reaction running at temperatures in excess of 100 million°C for 30 seconds for the first time and have finally been able to achieve a net energy gain when carrying out a nuclear fusion experiment. That's nearly seven times hotter than the core of the Sun, which has a temperature of 15 million degrees kelvins! That's exciting!\n{{user}}: Wow! That's super interesting to know. Change of topic, I plan to change to the iPhone 14 this year.\n{{char}}: I see. What makes you want to change to iPhone 14?\n{{user}}: My phone right now is too old, so I want to upgrade.\n{{char}}: That's always a good reason to upgrade. You should be able to save money by trading in your old phone for credit. I hope you enjoy your new phone when you upgrade."
}
temperature=0.7
repetition_penalty=1.1764705882352942
top_k=40
top_p=0.1
I get this error
Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
Loading llama-30b-4bit...
Loading model ...
Done.
Loaded the model in 6.55 seconds.
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 20.14 seconds (6.80 tokens/s, 137 tokens)
Output generated in 15.50 seconds (4.39 tokens/s, 68 tokens)
Output generated in 15.71 seconds (4.33 tokens/s, 68 tokens)
Exception in thread Thread-6 (gentask):
Traceback (most recent call last):
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 191, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
outputs = self.model(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 318, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 233, in forward
key_states = torch.cat([past_key_value[0], key_states], dim=2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.64 GiB total capacity; 21.75 GiB already allocated; 25.50 MiB free; 22.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any help is appreciated Thank you!
Are you on windows or Linux?
Linux, Ubuntu 22.10 with a local miniconda environment
This issue seems to be related to what you're experiencing. https://github.com/oobabooga/text-generation-webui/issues/256
It seems that --gpu-memory is bugged. I've also been having issues with --auto-devices. There might also be a memory leak somewhere.
I should be feasibly be able to run the 13b model with my 1060 6gb with --auto-devices enabled but I haven't had any luck with it.
I'm now working around it by lowering "Maximum prompt size in tokens" to 1024 - I'm using 512 right now
@alexl83
Have you watched the behavior live as it is processing? Either through nvidiaXserver or nvidia-smi (would have to spam this but still works), if you did this you could have more information of the behavior as it is happening. Just a suggestion, and could lead to more specific answers.
@ImpossibleExchange - model is consistently taking around 17.7 GB VRAM - regardless of any command-line option On top of that, upon first launch it's loaded into RAM (around 33GB) and then moved to VRAM killing the model and reloading seems to skip the RAM step
@alexl83
Okay just spun up the 4bit and was running some text to it. For reference, I am running on 2x ampere cards for 48 GB total VRAM.
What I found out by sitting and spamming nvidia-smi is that I was getting around 22-23GB total used VRAM while the text was being generated. It would drop back down after it was finished, but it was hitting the numbers around where you were getting your "out of memory" error.
So, I would assume that is perhaps "normal" behavior/ usage for the time being. I also had the token length set for 200 tokens, not higher. This would lead me to assume if you have a higher token threshold for generation, you could be going higher.
Don't really know how much "help" this is, but I can confirm the Vram usage seems to be normal. I wasn't getting the out of ram message due to having 2x cards.
Perhaps try a smaller model is likely the best suggestion I can give sadly. = \ As I didn't see anything different on my box. Xubuntu OS.
Peace and all the best.
Thanks @ImpossibleExchange I appreciate your support investigating this :) Let's see, things are moving fast!
@alexl83
Also just ran the 30B on a single card, and yeah got an out of memory error. So, I guess that is that.
Sorry about the slow response, was battling getting Clear linux to work to try it out.
Are you able to generate ? Or are you crashing?
I was getting similar usage at the beginning, but would get VRAM usage spiked during generation of outputs. This was my experience on both Manjaro and Xubuntu.
@alexl83
Did you try the --auto-devices start argument? If other flags aren't helping you, this one at least got rid of the OOM errors.
I'm seeing the same issue with llama-30b-4bit-128g, and it seems to be worse compared to older 4-bit .pt models, so perhaps there was some recent change (from the batch that added .safetensors support) that causes increased VRAM use?
Okay, so this is down to groupsize. If you don't want to constantly run out of VRAM with llama-30b running on 24Gb, make sure that you use a model quantized with groupsize=1 rather than groupsize=128. E.g. one of these: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.