text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Open clementvp opened this issue 1 year ago • 14 comments

Describe the bug

Hello I'v got these messages, just after typing in the UI.

To  create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "F:\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "F:\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
    shared.model.generate(**kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Output generated in 5.09 seconds (0.00 tokens/s, 0 tokens, context 43)

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

Just launch the web UI
Start Typing

App crash

Screenshot

No response

Logs

To  create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "F:\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "F:\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
    shared.model.generate(**kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl    return forward_call(*args, **kwargs)
  File "F:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
Output generated in 5.09 seconds (0.00 tokens/s, 0 tokens, context 43)

System Info

Windows 10
Ryzen I7 3700x
RTX 2070 Super

clementvp avatar Apr 09 '23 15:04 clementvp

Got the excatly same error, anyone knows how to fix?

Gitbreast avatar Apr 09 '23 20:04 Gitbreast

Same issue here. on my RTX2080 laptop (8GB VRAM, 64GB RAM) Win11 22621.1485 Up-to-date Nvidia drivers.

I tried it on another computer with an A4500 card with the same install methods and everything worked fine. From the replies from everyone, could this be a problem specific to 20-series cards?

Log:

CUDA SETUP: CUDA runtime path found: C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\bin\cudart64_110.dll CUDA SETUP: Highest compute capability among GPUs detected: 7.5 CUDA SETUP: Detected CUDA version 117 CUDA SETUP: Loading binary C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll... Loading gpt-x-alpaca-13b-native-4bit-128g... Found the following quantized model: models\gpt-x-alpaca-13b-native-4bit-128g\gpt-x-alpaca-13b-native-4bit-128g-cuda.pt Loading model ... Done. Loaded the model in 7.83 seconds. Loading the extension "gallery"... Ok. Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch(). Traceback (most recent call last): File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback shared.model.generate(**kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample outputs = self( File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward outputs = self.model( File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\Username\AI\Oobabooga\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Output generated in 5.35 seconds (0.00 tokens/s, 0 tokens, context 44)


Found this guy who had the same issue as well. glad we aren't the only ones. https://www.reddit.com/r/Oobabooga/comments/12gjbev/noob_question_how_do_i_uninstall_oobabooga/

tonyzehs avatar Apr 10 '23 06:04 tonyzehs

I did some research, very probably is the 8GB RAM size problem for 13b 4bit models. I was trying to run Vicuna-13b-4bit on my RTX 2080 8GB then got this. Then I tried to add --pre_layer 30 --threads 16 (cpu threads) to the start.bat file. See this discussion. It works a while, but runs pretty slowly, 1 word/sec almost not useable. Then after some configuration, it gives me error again. I’m not sure if there is a better way to make it work, because I really don’t know about all these words and parameters — 13b, 4bit or 8bit, etc. Still expecting a professional answer.

Gitbreast avatar Apr 10 '23 14:04 Gitbreast

--threads 16 doesn't work for GPU use. --pre_layer 30 means it offloads ~10layers to CPU and keeps ~30 layers on GPU. As I read from another user one layer is ~0.222GB, which means 30 layers in GPU would take up to 6.6GB. Add the OS reserved VRAM and you're left with no room for larger context size. If you want to use the full 2k tokens you need --pre_layer 15 or so. Which makes it slower. Or up it to around 25 so you can use it some, but will OOM after a while.

ghost avatar Apr 11 '23 17:04 ghost

the same problem here:

Traceback (most recent call last):
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 228, in generate_with_callback
    shared.model.generate(**kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 162, in forward
    layer_outputs = decoder_layer(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 231, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 8.00 GiB total capacity; 6.93 GiB already allocated; 0 bytes free; 7.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 10.13 seconds (0.00 tokens/s, 0 tokens, context 1475, seed 437771917)
Traceback (most recent call last):
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 228, in generate_with_callback
    shared.model.generate(**kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 162, in forward
    layer_outputs = decoder_layer(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 231, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 8.00 GiB total capacity; 6.93 GiB already allocated; 0 bytes free; 7.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 9.32 seconds (0.00 tokens/s, 0 tokens, context 1475, seed 737525561)
Traceback (most recent call last):
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 228, in generate_with_callback
    shared.model.generate(**kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 162, in forward
    layer_outputs = decoder_layer(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\r2d2\Downloads\NN\GPT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 231, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 8.00 GiB total capacity; 6.93 GiB already allocated; 0 bytes free; 7.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 7.36 seconds (0.00 tokens/s, 0 tokens, context 1475, seed 162802677)

VitaliyAT avatar Apr 13 '23 11:04 VitaliyAT

I have the same issue.

GPU: NVIDIA Geforce RTX 2070 Super with 8 GB of mem. Running using oobabooga on windows.

Module loaded successfully, after writing any simple prompt ("how are you") it crash with this error:

Traceback (most recent call last): File "D:\gpt4\llama\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 66, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "D:\gpt4\llama\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 252, in generate_with_callback shared.model.generate(**kwargs) File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample outputs = self( File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward outputs = self.model( File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\gpt4\llama\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Output generated in 2.38 seconds (0.00 tokens/s, 0 tokens, context 35, seed 1369478006)

de5car7es avatar Apr 16 '23 19:04 de5car7es

From the replies from everyone, could this be a problem specific to certain 20-series cards?

tonyzehs avatar Apr 17 '23 01:04 tonyzehs

Guys, I almost confirmed that is the VRAM issue, it just doesn’t give the “out of memory” error but this kind of crashing error.

I loaded a 7B model which is a fine-tune version of Alpaca, it works well. The generating speed is double or triple of “13B using layer setting”, and it has no creepy “Buzz” sound comes from machine any more while generating. But it can’t be used with Gallery extension(custom characters), otherwise it would pop a VRAM error again. For optimizing, perhaps using --no-cache and --xformers would decrease the VRAM using, but my --xformers can’t work for some reasons for now.

However, for now, most of 7B models are not good at facts describing, even if --xformers can be used, I feel like 7B models only can be use to play a role play, write some simple sentences, etc. I believe limitless VRAM isn’t the only way to increase the quality. Let’s wait for better model solution, don’t forget to let me know if there’s a good model — could be used on 8G VRAM. Or let time drown our poor little VRAM. 😂

Gitbreast avatar Apr 17 '23 07:04 Gitbreast

call python server.py --auto-devices --extensions api --wbits 4 --groupsize 128 --pre_layer 35 --gpu-memory 7 --model-menu

This parameters work for me but model generates only 1 token/second. I have RTX 2080 8GB VRAM Also I increased Virtual Memory size to 32GB on my SSD drive

I recommend --xformers flag too. Just install them with pip install xformers. It increased speed to 1.6it/s

morganavr avatar Apr 22 '23 11:04 morganavr

GPU: NVIDIA Geforce RTX 2080 Super Laptop 32gb ram

File "C:\TCHT\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Output generated in 1.17 seconds (0.00 tokens/s, 0 tokens, context 23, seed 1193915026)

PaulShroom avatar Apr 28 '23 03:04 PaulShroom

I have the same problem on a NVIDA GFORCE RTX 2080

  File "C:\AI\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward
    attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `

sohneg avatar May 01 '23 15:05 sohneg

Same here RTX 2070

File "C:\AI\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Output generated in 1.55 seconds (0.00 tokens/s, 0 tokens, context 27, seed 5223124)

TheLolkPlays avatar May 07 '23 21:05 TheLolkPlays

Guys, I almost confirmed that is the VRAM issue, it just doesn’t give the “out of memory” error but this kind of crashing error.

I loaded a 7B model which is a fine-tune version of Alpaca, it works well. The generating speed is double or triple of “13B using layer setting”, and it has no creepy “Buzz” sound comes from machine any more while generating. But it can’t be used with Gallery extension(custom characters), otherwise it would pop a VRAM error again. For optimizing, perhaps using --no-cache and --xformers would decrease the VRAM using, but my --xformers can’t work for some reasons for now.

However, for now, most of 7B models are not good at facts describing, even if --xformers can be used, I feel like 7B models only can be use to play a role play, write some simple sentences, etc. I believe limitless VRAM isn’t the only way to increase the quality. Let’s wait for better model solution, don’t forget to let me know if there’s a good model — could be used on 8G VRAM. Or let time drown our poor little VRAM. 😂

@Gitbreast What the hell is that buzzing, I hear it too 😆

TheLolkPlays avatar May 07 '23 22:05 TheLolkPlays

Got same issue here, my card is 4095MB NVIDIA GeForce RTX 2080 SUPER, trying to run with anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g

To create a public link, set share=True in launch(). Traceback (most recent call last): File "E:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "E:\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 251, in generate_with_callback shared.model.generate(**kwargs) File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample outputs = self( File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward outputs = self.model( File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward layer_outputs = decoder_layer( File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "E:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "E:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 214, in forward attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) Output generated in 2.11 seconds (0.00 tokens/s, 0 tokens, context 36, seed 630556558)

PandaEyesPrime avatar May 08 '23 18:05 PandaEyesPrime

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Oct 10 '23 23:10 github-actions[bot]