text-generation-webui
text-generation-webui copied to clipboard
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Describe the bug
Hello I'v got these messages, just after typing in the UI.
To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
File "F:\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "F:\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
shared.model.generate(**kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Output generated in 5.09 seconds (0.00 tokens/s, 0 tokens, context 43)
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Just launch the web UI
Start Typing
App crash
Screenshot
No response
Logs
To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
File "F:\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "F:\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
shared.model.generate(**kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Output generated in 5.09 seconds (0.00 tokens/s, 0 tokens, context 43)
System Info
Windows 10
Ryzen I7 3700x
RTX 2070 Super
Got the excatly same error, anyone knows how to fix?
Same issue here. on my RTX2080 laptop (8GB VRAM, 64GB RAM) Win11 22621.1485 Up-to-date Nvidia drivers.
I tried it on another computer with an A4500 card with the same install methods and everything worked fine. From the replies from everyone, could this be a problem specific to 20-series cards?
Log:
CUDA SETUP: CUDA runtime path found: C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\bin\cudart64_110.dll CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll... Loading gpt-x-alpaca-13b-native-4bit-128g... Found the following quantized model: models\gpt-x-alpaca-13b-native-4bit-128g\gpt-x-alpaca-13b-native-4bit-128g-cuda.pt Loading model ... Done. Loaded the model in 7.83 seconds. Loading the extension "gallery"... Ok. Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
Traceback (most recent call last):
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
Output generated in 5.35 seconds (0.00 tokens/s, 0 tokens, context 44)
Found this guy who had the same issue as well. glad we aren't the only ones. https://www.reddit.com/r/Oobabooga/comments/12gjbev/noob_question_how_do_i_uninstall_oobabooga/
I did some research, very probably is the 8GB RAM size problem for 13b 4bit models. I was trying to run Vicuna-13b-4bit on my RTX 2080 8GB then got this.
Then I tried to add --pre_layer 30 --threads 16
(cpu threads) to the start.bat file. See this discussion. It works a while, but runs pretty slowly, 1 word/sec almost not useable. Then after some configuration, it gives me error again.
I’m not sure if there is a better way to make it work, because I really don’t know about all these words and parameters — 13b, 4bit or 8bit, etc.
Still expecting a professional answer.
--threads 16 doesn't work for GPU use. --pre_layer 30 means it offloads ~10layers to CPU and keeps ~30 layers on GPU. As I read from another user one layer is ~0.222GB, which means 30 layers in GPU would take up to 6.6GB. Add the OS reserved VRAM and you're left with no room for larger context size. If you want to use the full 2k tokens you need --pre_layer 15 or so. Which makes it slower. Or up it to around 25 so you can use it some, but will OOM after a while.
the same problem here:
Traceback (most recent call last):
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 228, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 162, in forward
layer_outputs = decoder_layer(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 231, in forward
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 8.00 GiB total capacity; 6.93 GiB already allocated; 0 bytes free; 7.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 10.13 seconds (0.00 tokens/s, 0 tokens, context 1475, seed 437771917)
Traceback (most recent call last):
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 228, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 162, in forward
layer_outputs = decoder_layer(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 231, in forward
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 8.00 GiB total capacity; 6.93 GiB already allocated; 0 bytes free; 7.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 9.32 seconds (0.00 tokens/s, 0 tokens, context 1475, seed 737525561)
Traceback (most recent call last):
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 228, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 162, in forward
layer_outputs = decoder_layer(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 231, in forward
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 8.00 GiB total capacity; 6.93 GiB already allocated; 0 bytes free; 7.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 7.36 seconds (0.00 tokens/s, 0 tokens, context 1475, seed 162802677)
I have the same issue.
GPU: NVIDIA Geforce RTX 2070 Super with 8 GB of mem. Running using oobabooga on windows.
Module loaded successfully, after writing any simple prompt ("how are you") it crash with this error:
Traceback (most recent call last):
File "D:\gpt4\llama\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\gpt4\llama\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 252, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
Output generated in 2.38 seconds (0.00 tokens/s, 0 tokens, context 35, seed 1369478006)
From the replies from everyone, could this be a problem specific to certain 20-series cards?
Guys, I almost confirmed that is the VRAM issue, it just doesn’t give the “out of memory” error but this kind of crashing error.
I loaded a 7B model which is a fine-tune version of Alpaca, it works well. The generating speed is double or triple of “13B using layer setting”, and it has no creepy “Buzz” sound comes from machine any more while generating. But it can’t be used with Gallery extension(custom characters), otherwise it would pop a VRAM error again. For optimizing, perhaps using --no-cache
and --xformers
would decrease the VRAM using, but my --xformers
can’t work for some reasons for now.
However, for now, most of 7B models are not good at facts describing, even if --xformers
can be used, I feel like 7B models only can be use to play a role play, write some simple sentences, etc. I believe limitless VRAM isn’t the only way to increase the quality. Let’s wait for better model solution, don’t forget to let me know if there’s a good model — could be used on 8G VRAM. Or let time drown our poor little VRAM. 😂
call python server.py --auto-devices --extensions api --wbits 4 --groupsize 128 --pre_layer 35 --gpu-memory 7 --model-menu
This parameters work for me but model generates only 1 token/second. I have RTX 2080 8GB VRAM Also I increased Virtual Memory size to 32GB on my SSD drive
I recommend --xformers flag too. Just install them with pip install xformers. It increased speed to 1.6it/s
GPU: NVIDIA Geforce RTX 2080 Super Laptop 32gb ram
File "C:\TCHT\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
Output generated in 1.17 seconds (0.00 tokens/s, 0 tokens, context 23, seed 1193915026)
I have the same problem on a NVIDA GFORCE RTX 2080
File "C:\AI\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `
Same here RTX 2070
File "C:\AI\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling
cublasCreate(handle) Output generated in 1.55 seconds (0.00 tokens/s, 0 tokens, context 27, seed 5223124)
Guys, I almost confirmed that is the VRAM issue, it just doesn’t give the “out of memory” error but this kind of crashing error.
I loaded a 7B model which is a fine-tune version of Alpaca, it works well. The generating speed is double or triple of “13B using layer setting”, and it has no creepy “Buzz” sound comes from machine any more while generating. But it can’t be used with Gallery extension(custom characters), otherwise it would pop a VRAM error again. For optimizing, perhaps using
--no-cache
and--xformers
would decrease the VRAM using, but my--xformers
can’t work for some reasons for now.However, for now, most of 7B models are not good at facts describing, even if
--xformers
can be used, I feel like 7B models only can be use to play a role play, write some simple sentences, etc. I believe limitless VRAM isn’t the only way to increase the quality. Let’s wait for better model solution, don’t forget to let me know if there’s a good model — could be used on 8G VRAM. Or let time drown our poor little VRAM. 😂
@Gitbreast What the hell is that buzzing, I hear it too 😆
Got same issue here, my card is 4095MB NVIDIA GeForce RTX 2080 SUPER, trying to run with anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g
To create a public link, set share=True
in launch()
.
Traceback (most recent call last):
File "E:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "E:\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 251, in generate_with_callback
shared.model.generate(**kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
Output generated in 2.11 seconds (0.00 tokens/s, 0 tokens, context 36, seed 630556558)
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.