text-generation-webui 4bit LLaMA-30B: Out of memory

Dear All, I'm running 30B in 4bit on my 4090 24 GB + Ryzen 7700X and 64GB ram

after generating some tokens asking to produce code I get out of memory errors using --gpu-memory has no effects server line python server.py --auto-devices --gpu-memory 20 --load-in-4bit --cai-chat --listen --extensions gallery llama_prompts --model llama-30b-4bit

Using a reddit-found character and preset:

{
    "char_name": "LLaMA-Precise",
    "char_persona": "LLaMA-Precise is a helpful AI chatbot that always provides useful and detailed answers to User's requests and questions. LLaMA-Precise tries to be as informative and friendly as possible.",
    "char_greeting": "Hello! I am LLaMA-Precise, your informative assistant. How may I help you today?",
    "world_scenario": "",
    "example_dialogue": "{{user}}: Hi. Can you help me with something?\n{{char}}: Hello, this is LLaMA-Precise. How can I help?\n{{user}}: Have you heard of the latest nuclear fusion experiment from South Korea? I heard their experiment got hotter than the sun.\n{{char}}: Yes, I have heard about the experiment. Scientists in South Korea have managed to sustain a nuclear fusion reaction running at temperatures in excess of 100 million°C for 30 seconds for the first time and have finally been able to achieve a net energy gain when carrying out a nuclear fusion experiment. That's nearly seven times hotter than the core of the Sun, which has a temperature of 15 million degrees kelvins! That's exciting!\n{{user}}: Wow! That's super interesting to know. Change of topic, I plan to change to the iPhone 14 this year.\n{{char}}: I see. What makes you want to change to iPhone 14?\n{{user}}: My phone right now is too old, so I want to upgrade.\n{{char}}: That's always a good reason to upgrade. You should be able to save money by trading in your old phone for credit. I hope you enjoy your new phone when you upgrade."
}

temperature=0.7
repetition_penalty=1.1764705882352942
top_k=40
top_p=0.1

I get this error

Loading the extension "gallery"... Ok.
Loading the extension "llama_prompts"... Ok.
Loading llama-30b-4bit...
Loading model ...
Done.
Loaded the model in 6.55 seconds.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 20.14 seconds (6.80 tokens/s, 137 tokens)
Output generated in 15.50 seconds (4.39 tokens/s, 68 tokens)
Output generated in 15.71 seconds (4.33 tokens/s, 68 tokens)
Exception in thread Thread-6 (gentask):
Traceback (most recent call last):
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/alex/oobabooga/text-generation-webui/modules/text_generation.py", line 191, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 318, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/alex/oobabooga/installer_files/env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 233, in forward
    key_states = torch.cat([past_key_value[0], key_states], dim=2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 23.64 GiB total capacity; 21.75 GiB already allocated; 25.50 MiB free; 22.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any help is appreciated Thank you!

Mar 13 '23 17:03 alexl83

Are you on windows or Linux?

Mar 13 '23 18:03 dnhkng

Linux, Ubuntu 22.10 with a local miniconda environment

Mar 13 '23 18:03 alexl83

This issue seems to be related to what you're experiencing. https://github.com/oobabooga/text-generation-webui/issues/256

It seems that --gpu-memory is bugged. I've also been having issues with --auto-devices. There might also be a memory leak somewhere.

I should be feasibly be able to run the 13b model with my 1060 6gb with --auto-devices enabled but I haven't had any luck with it.

Mar 13 '23 18:03 remghoost

I'm now working around it by lowering "Maximum prompt size in tokens" to 1024 - I'm using 512 right now

Mar 13 '23 18:03 alexl83

@alexl83

Have you watched the behavior live as it is processing? Either through nvidiaXserver or nvidia-smi (would have to spam this but still works), if you did this you could have more information of the behavior as it is happening. Just a suggestion, and could lead to more specific answers.

Mar 13 '23 18:03 ImpossibleExchange

@ImpossibleExchange - model is consistently taking around 17.7 GB VRAM - regardless of any command-line option On top of that, upon first launch it's loaded into RAM (around 33GB) and then moved to VRAM killing the model and reloading seems to skip the RAM step

Mar 13 '23 19:03 alexl83

@alexl83

Okay just spun up the 4bit and was running some text to it. For reference, I am running on 2x ampere cards for 48 GB total VRAM.

What I found out by sitting and spamming nvidia-smi is that I was getting around 22-23GB total used VRAM while the text was being generated. It would drop back down after it was finished, but it was hitting the numbers around where you were getting your "out of memory" error.

So, I would assume that is perhaps "normal" behavior/ usage for the time being. I also had the token length set for 200 tokens, not higher. This would lead me to assume if you have a higher token threshold for generation, you could be going higher.

Don't really know how much "help" this is, but I can confirm the Vram usage seems to be normal. I wasn't getting the out of ram message due to having 2x cards.

Perhaps try a smaller model is likely the best suggestion I can give sadly. = \ As I didn't see anything different on my box. Xubuntu OS.

Peace and all the best.

Mar 13 '23 19:03 ImpossibleExchange

Thanks @ImpossibleExchange I appreciate your support investigating this :) Let's see, things are moving fast!

Mar 13 '23 19:03 alexl83

@alexl83

Also just ran the 30B on a single card, and yeah got an out of memory error. So, I guess that is that.

Mar 13 '23 21:03 ImpossibleExchange

Sorry about the slow response, was battling getting Clear linux to work to try it out.

Are you able to generate ? Or are you crashing?

I was getting similar usage at the beginning, but would get VRAM usage spiked during generation of outputs. This was my experience on both Manjaro and Xubuntu.

Mar 18 '23 17:03 ImpossibleExchange

@alexl83 Did you try the --auto-devices start argument? If other flags aren't helping you, this one at least got rid of the OOM errors.

Mar 19 '23 22:03 MillionthOdin16

I'm seeing the same issue with llama-30b-4bit-128g, and it seems to be worse compared to older 4-bit .pt models, so perhaps there was some recent change (from the batch that added .safetensors support) that causes increased VRAM use?

Mar 29 '23 08:03 int19h

Okay, so this is down to groupsize. If you don't want to constantly run out of VRAM with llama-30b running on 24Gb, make sure that you use a model quantized with groupsize=1 rather than groupsize=128. E.g. one of these: https://github.com/oobabooga/text-generation-webui/pull/530#issuecomment-1483891617

Mar 29 '23 09:03 int19h

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Dec 06 '23 23:12 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

4bit LLaMA-30B: Out of memory

text-generation-webui
text-generation-webui copied to clipboard