text-generation-webui CUDA out of memory

Describe the bug

Hi everyone,

So I had some issues at first starting the UI but after searching here and reading the documentation I managed to make this work. I used the oobabooga-windows.zip from the Releases to install the UI And had to edit the start-webui.bat to make it work.

I added the --load-in-8bit , --wbits 4, --groupsize 128 and changed the --cai-chat to --chat I used the Low VRAM guide call python server.py --load-in-8bit --chat --wbits 4 --groupsize 128 --auto-devices

I think after adding the --wbits 4 --groupsize 128 parameters the --auto-devices is not registering and not limiting the memory.

Now, I can load both models I have with no issue and start any conversation, I'm using the default example character for testing and after 10 - 12 responses/prompts in the chat I get CUDA out of memory

If I delete several to responses from the chat box and prompt again it works until I reach the same number of prompts and get out of memory again.

when that happens it just stopes in the middle of a sentence.

I haven't seen any mention of this limitation in the documentation so I apologies if this is by design but it seems like a bug to me.

I'll appreciate if any one have any suggestions or advises on how to resolve this issue other then keep my conversations short. Ideally I prefer to keep the conversations indefinite.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Load any model Start a conversation with any caharacter from the gallery (I used the example one) Reach over 10 responses from the AI

Screenshot

Logs

Starting the web UI...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: V:\oobabooga\installer_files\env\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary V:\oobabooga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll...
The following models are available:

1. gpt4-x-alpaca-13b-native-4bit-128g
2. vicuna-13b-GPTQ-4bit-128g

Which one do you want to load? 1-2

2

Loading vicuna-13b-GPTQ-4bit-128g...
Loading model ...
V:\oobabooga\installer_files\env\lib\site-packages\safetensors\torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
V:\oobabooga\installer_files\env\lib\site-packages\torch\_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
V:\oobabooga\installer_files\env\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
Done.
Loaded the model in 14.24 seconds.
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 13.12 seconds (5.79 tokens/s, 76 tokens, context 882)
Output generated in 11.45 seconds (6.90 tokens/s, 79 tokens, context 991)
Traceback (most recent call last):
  File "V:\oobabooga\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "V:\oobabooga\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
    shared.model.generate(**kwargs)
  File "V:\oobabooga\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "V:\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "V:\oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "V:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "V:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "V:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "V:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "V:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "V:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "V:\oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "V:\oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 210, in forward
    value_states = torch.cat([past_key_value[1], value_states], dim=2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 10.00 GiB total capacity; 8.83 GiB already allocated; 0 bytes free; 9.29 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 8.53 seconds (6.09 tokens/s, 52 tokens, context 1111)

System Info

Operating System: Windows 11 Pro 64-bit (10.0, Build 22621) (22621.ni_release.220506-1250)
Memory: 32GB RAM

NVIDIA GeForce RTX 3080
Memory: 10 GB

Apr 09 '23 13:04 NeoLoger

Same error for me also running AMD Ryzen 9 7900X 12-Core Memory: 32GB RAM

NVIDIA GeForce RTX 3080 Memory: 10 GB

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 10.00 GiB total capacity; 8.77 GiB already allocated; 0 bytes free; 9.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Output generated in 6.62 seconds (0.15 tokens/s, 1 tokens, context 1331)

Apr 09 '23 15:04 AossGG

That's just how the system works. Your GPU likely doesn't have enough VRAM for both storing the model weights, and running inference against a full context. As your conversation grows, the model is processing a larger and larger context until you're running out of space. In "parameters", change "Maximum prompt size in tokens" to a smaller number, and it should stop you from hitting OOM.

Apr 09 '23 19:04 brandonj60

I can see the logic in your what you suggesting but its not the case in this instance. As you suggested I tried to change "Maximum prompt size in tokens" down to double digests I even played with the "max_new_tokens" parameter and I'm still getting OOM near the same amount of world.

At least it led to a funny comment by the AI that her phone just died (LOL).

But it didn't solved the problem, I think there is an issue here. It can't be that I can only hold a conversation out of 10 comments, its not enough for me to do anything really interesting.

Had anyone reached any similar cap?

Apr 09 '23 20:04 NeoLoger

it's early days, both models you have listed will, on my machine chew up 11.8GiB of VRAM and hold it (Win 10 RTX 3060 12GiB VRAM) so it is definitely not optimized nor do I see any improvements with a dual GPU environment and I have cycled through a lot of the --flags in the hopes of stumbling onto something that works. I have noticed as you have that using the --wbits 4 --groupsize 128 negates any benefits the --auto-devices flag has. At least in my set-up(s)

Apr 09 '23 20:04 CeterPushing

Ok I did some more searching and experimentation and found that if I add --pre_layer 34 flag it resolve my OOM issue but it makes it very slow, if I up the number to 36 its crashing.

Its still something, and I can work with it but I would like it to be a bit faster. In any case I don't see the issue as resolved yet and I don't really like this solution.

I think there is still a fundamental problem here and there is plenty of room to optimize and make it work properly. If anyone have any other ideas I'm open for suggestions.

EDIT: No, sorry It didn't resolved the issue It just gave me some additional prompts to do before hitting the OOM (I got 9 extra prompts) This is till preventing me from using this amazing tool, there must be a proper solution.

Apr 09 '23 21:04 NeoLoger

I am pretty sure that the maximum context size limit parameter is bugged currently. I also get an OOM error after a while on my 4090 when trying to run the 30B parameter 4bit LLaMA model. Setting max context low makes no difference, and appears to be ignored.

30B 3bit works fine in the meantime, but unfortunately I haven't been able to find a 30B 3bit Alpaca model yet. MetaIX/Alpaca-30B-Int4-128G-Safetensors works well for me, right up until I hit around 1400 context and then it breaks. My guess is that with some optimization (maybe using xformers?) 30B 4bit should work within 24GB VRAM.

Apr 09 '23 22:04 Dusterwald

I have also experienced the same issue, and I agree that the maximum context size limit parameter seems to have a bug. In my case, I'm using a system with 32GB RAM, an NVIDIA 2080 Ti with 11GB VRAM, and the gpt4-x-alpaca-13b-4bit-128g model. Despite trying various workarounds, I still encounter the CUDA out of memory error after a certain number of conversation turns. I hope this issue can be addressed and resolved soon.

Apr 10 '23 15:04 yoonch9009

I have also experienced the same issue, and I agree that the maximum context size limit parameter seems to have a bug. In my case, I'm using a system with 32GB RAM, an NVIDIA 2080 Ti with 11GB VRAM, and the gpt4-x-alpaca-13b-4bit-128g model. Despite trying various workarounds, I still encounter the CUDA out of memory error after a certain number of conversation turns. I hope this issue can be addressed and resolved soon.

Same problem here. I have the exact same configuration you have and I run into same problem.

Apr 10 '23 18:04 El-Hondo

Hey folks, I was facing the same issue. The problem seems to be that the maximum prompt size is hardcoded to 2048.

Edit: Here is the file and line you need to modify: https://github.com/oobabooga/text-generation-webui/blob/main/modules/shared.py#L39 Set it initially to something like 300 and test chatting for a bit to see if you run out of memory. Keep increasing the value and testing until you run out of memory

Apr 10 '23 20:04 Javier-Machin

Edit: Here is the file and line you need to modify: https://github.com/oobabooga/text-generation-webui/blob/main/modules/shared.py#L39 Set it initially to something like 300 and test chatting for a bit to see if you run out of memory.

Pretty sure all this does is set the initial value of the "Maximum prompt size in tokens" param, which you can do yourself in the parameters tab of the UI. It's utilized here: https://github.com/oobabooga/text-generation-webui/blob/main/server.py#L435

Apr 10 '23 22:04 brandonj60

Changing it in the UI does nothing for me and a few others though, it continues to generate with a limit of 2024

Apr 10 '23 22:04 Javier-Machin

Print out max_length at line 33 of chat.py. You're saying it always says 2048, regardless of the UI slider?

Apr 10 '23 23:04 brandonj60

Looks like it is fixed now with the latest updates. Just pulled the latest version and maximum context appears to be respected now. Based on the git changes it looks like it was a UI bug with the value of the slider not being read correctly I think.

Apr 10 '23 23:04 Dusterwald

Yes, looks like changing he Maximum prompt size in tokens slider helps with the OOM now, I don't know yet if this fix the issue or just allows for some additional prompts.

Is there a way to save the state of the sliders between restarts? I need to change it every time I start the app.

Apr 11 '23 09:04 NeoLoger

on 4bit models you can offload with only --pre_layers 20 or some number 15-30 or so (for 13B on 8GB VRAM). I found that --pre_layers 15 allows for full 2k tokens context but performance suffers (under 1token/s). --pre_layers 35 OOMs, and 30 works but only for two-three questions then it OOMs.

Apr 11 '23 16:04 ghost

I've been wondering if this solution needs some sort of vectorDB or pinecone.io integration to store the conversation for long term. I've been having the same problem.

Apr 12 '23 12:04 SerenityNrrd

I am having the same problem here...

Processor: Ryzen 5 3600 GPU: RTX 3060Ti RAM: 32GB

Apr 12 '23 18:04 BsouzaM

Same issue with Ubuntu via WSL on Windows 11, with RTX 3080 Ti (12GB). Using Vicuna. It does seem like this wasn't an issue until recently, however. I was able to use MUCH bigger contexts until recently.

Apr 21 '23 05:04 Sourdface

I think it's not an issue of VRam as even highly specked computers are also facing the same thing. I have low Vram but I am getting responses fairly quick only problem is I am not able to push the conversation too far ahead. Sometimes even less than 4 dilogues.,But I hear that people with high specked graphics card are facing the same issue infact having each word load at 5 s. Now that's too much. Hope someone fix the code for this soon.

May 07 '23 12:05 DarkCoverUnleashed

Has anyone managed to solve the problem? So many high-quality computers (seemingly) that fail to upload simple models,

Even when there is enough memory? means a systemic problem

May 22 '23 07:05 yf007

Has anyone managed to solve the problem? So many high-quality computers (seemingly) that fail to upload simple models,

Even when there is enough memory? means a systemic problem

Add --auto-devices --gpu-memory 2000MiB Here is a complete command-example in webui.py file for wxjiao_alpaca-7b model, loaded in 6GB VRAM:

run_cmd("python server.py --chat --model_type llama --model wxjiao_alpaca-7b --auto-devices --gpu-memory 2000MiB", environment=True) # put your flags here!

May 22 '23 08:05 AngelTs

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Oct 10 '23 23:10 github-actions[bot]

I'm facing the same issue while trying to run "Open_Gpt4_8x7B_v0.2-GPTQ".

Error:

Traceback (most recent call last):

File "C:\Users\Anuj\text-generation-webui\modules\ui_model_menu.py", line 213, in load_model_wrapper

shared.model, shared.tokenizer = load_model(selected_model, loader)

                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Users\Anuj\text-generation-webui\modules\models.py", line 87, in load_model

output = load_func_maploader

     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Users\Anuj\text-generation-webui\modules\models.py", line 389, in ExLlamav2_HF_loader

return Exllamav2HF.from_pretrained(model_name)

   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Users\Anuj\text-generation-webui\modules\exllamav2_hf.py", line 170, in from_pretrained

return Exllamav2HF(config)

   ^^^^^^^^^^^^^^^^^^^

File "C:\Users\Anuj\text-generation-webui\modules\exllamav2_hf.py", line 44, in init

self.ex_model.load(split)

File "C:\Users\Anuj.conda\envs\textgen\Lib\site-packages\exllamav2\model.py", line 244, in load

for item in f: return item

File "C:\Users\Anuj.conda\envs\textgen\Lib\site-packages\exllamav2\model.py", line 263, in load_gen

module.load()

File "C:\Users\Anuj.conda\envs\textgen\Lib\site-packages\exllamav2\moe_mlp.py", line 61, in load

self.w3[e].load()

File "C:\Users\Anuj.conda\envs\textgen\Lib\site-packages\exllamav2\linear.py", line 45, in load

if w is None: w = self.load_weight()

              ^^^^^^^^^^^^^^^^^^

File "C:\Users\Anuj.conda\envs\textgen\Lib\site-packages\exllamav2\module.py", line 92, in load_weight

qtensors = self.load_multi(["qweight", "qzeros", "scales", "g_idx"])

       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\Users\Anuj.conda\envs\textgen\Lib\site-packages\exllamav2\module.py", line 75, in load_multi

tensors[k] = st.get_tensor(self.key + "." + k).to(self.device())

         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacty of 3.00 GiB of which 0 bytes is free. Of the allocated memory 10.98 GiB is allocated by PyTorch, and 19.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jan 18 '24 15:01 Anujjake

text-generation-webui
text-generation-webui copied to clipboard

CUDA out of memory - during a conversation

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui text-generation-webui copied to clipboard

CUDA out of memory - during a conversation

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard