text-generation-webui "cublasLt ran into an error" with older GPU in 8-bit mode

Describe the bug

my device is GTX 1650 4GB，i5-12400 , 40BG RAM. Ubuntu 20.04. cuda11.8

I have set llama-7b according to the wiki I can run it with python server.py --listen --auto-devices --model llama-7b and everything goes well!

But I can't run with --load-in-8bit . According to https://github.com/oobabooga/text-generation-webui/pull/366 I should use this. when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit There is no error, everything seeming good，BUT once I use the web ui click the ‘Generate’ button，

there error comes in the terminal

(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00,  4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
    outputs = self.model(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

this is not only happened for llama-7b , it can easily reproduction in anyother models like: run python server.py --listen --model opt-1.3b --load-in-8bit

There is no error, BUT once use the web ui enter anything and click the ‘Generate’ button，

there error comes in the terminal, it seems the bug has something to do with the cublasLt , like a cuda bug.

and there is no bug wih cpu python server.py --listen --model opt-1.3b --load-in-8bit it's going well.

Screenshot

No response

Logs

(textgen) wk:text-generation-webui$ python server.py --listen  --model opt-1.3b --load-in-8bit
Loading opt-1.3b...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loaded the model in 3.34 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 2048]), B: torch.Size([2048, 2048]), C: (16, 2048); (lda, ldb, ldc): (c_int(512), c_int(65536), c_int(512)); (m, n, k): (c_int(16), c_int(2048), c_int(2048))
Exception in thread Thread-3 (gentask):
error detectedTraceback (most recent call last):
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
    outputs = self(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 930, in forward
    outputs = self.model.decoder(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 696, in forward
    layer_outputs = decoder_layer(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 326, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 171, in forward
    query_states = self.q_proj(hidden_states) * self.scaling
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!



### System Info

```shell
my device is  GTX 1650 4GB，i5-12400 , 40BG RAM. Ubuntu 20.04. cuda11.8

Mar 17 '23 13:03 wk-mike

I got the same error. I use 2 GPU and trying to run pygmalion-2.7b with 8bit. I use Windows.

My start-webui.bat file: call python server.py --auto-devices --cai-chat --share --gpu-memory 5 3 --load-in-8bit

I made also that: https://www.reddit.com/r/PygmalionAI/comments/1115gom/running_pygmalion_6b_with_8gb_of_vram/ But I use libbitsandbytes_cudaall.dll for my GeForce 1660 + 960 cards.

Mar 17 '23 14:03 rafx85

This also happens to me on a GTX 1650 GPU.

Mar 17 '23 15:03 oobabooga

I think 8bit in bitsandbytes requires Turing(20xx) or later: https://github.com/TimDettmers/bitsandbytes#requirements--installation

LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).

Mar 17 '23 15:03 sgsdxzy

On older GPU it will NEVER work with int8 threshold at 6. But I get NaN error and not this error on my P6000. I am using the pre "fixed" bits and bytes that never completed the "cuda setup" part.

I'll try it with the new bits and bytes that I don't have to patch and see if I get this error instead.

But best believe that it is possible.

8bitPascal

Mar 17 '23 15:03 Ph0rk0z

So i have been having this error too.

My setup is: Ryzen 5800x, 32GB DDR4 (25gb ZRAM compressed swap), 3060ti (8GB) and 2080super (8GB) Ubuntu 22.04 Cuda 11.8 pytorch2.0+cu118

I got it to generate by setting export CUDA_VISIBLE_DEVICES=1,0 (in Windows, use set CUDA_VISIBLE_DEVICES=1,0 but i havent tested that yet) Noting i swapped the numbers around. Setting it 0,1 always resulted in the error

Doesnt help those that have single GPUs, but its a start i hope.

Mar 17 '23 17:03 askmyteapot

I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.

I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so again like windows users use a fixed .dll? Not sure.

I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it. See if the instructions here - https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ help you? I was finally able to get 4bit working after following the instructions here.

Mar 17 '23 17:03 lolxdmainkaisemaanlu

I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.

I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so again like windows users use a fixed .dll

I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it.

I compiled from source for bitsandbytes as well as trying the pip package, just to avoid the issue of the .so 4bit works. its only 8bit that was causing me headaches. And it looks like we need 8bit to use LoRAs in LLaMa

Mar 17 '23 17:03 askmyteapot

I compiled from source for bitsandbytes as well as trying the pip package, just to avoid the issue of the .so 4bit works. its only 8bit that was causing me headaches. And it looks like we need 8bit to use LoRAs in LLaMa

I think your issue might be something related to an improper installation because from what I understand these 8bit issues are only on older GPUs from 1xxx series and lower. Your 2080 Super and 3060ti are perfectly compatible even with the native int8 function from bitsandbytes, you shouldn't have any need to even compile from source...

Perhaps try running in 16 bit.. You have 16GB VRAM which should be more than enough.

Mar 17 '23 17:03 lolxdmainkaisemaanlu

@lolxdmainkaisemaanlu thank you,

can you tell where should I put this bitsandbytes-prebuilt-all_arch/0.37.0/libbitsandbytes_cudaall.dll to which folder ?

and do I have to change code about this webui?

Mar 17 '23 17:03 wk-mike

installer_files\env\lib\site-packages\bitsandbytes\

Put it here, but still there is the same bug for me.

Mar 17 '23 19:03 rafx85

I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.

I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so again like windows users use a fixed .dll? Not sure.

I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it. See if the instructions here - https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ help you? I was finally able to get 4bit working after following the instructions here.

Did you try to run it in 8bit? Do you have error then or not?

Mar 17 '23 19:03 rafx85

I come here to tell you that the new accepted transformers is slow for me and I have no clue what is wrong on your cards and why mine works.

I patch models.py like this: https://pastebin.com/siPxZvkc

And then I can generate away: https://pastebin.com/R3JCmJ9L

I can even do the lora just fine. lora

Fixed bits and bytes from pypy works, its just more verbose in messages.

Mar 18 '23 15:03 Ph0rk0z

I also have this error on GTX 1660 Ti. I'm guessing this means GTX 16XX series isn't compatible despite also being Turing architecture.

Mar 30 '23 21:03 Mar2ck

Looks like GTX 16XX does support 8-bit, it just wasn't enabled in bitsandbytes until now. https://github.com/TimDettmers/bitsandbytes/pull/292 So starting with bitsandbytes 0.38.0 these GPUs should work.

EDIT: Just tested with bitsandbytes upgraded to 0.38.0.post2 on GTX 1660 Ti and it works perfectly.

Apr 12 '23 20:04 Mar2ck

try rebuild bitsandbytes from https://github.com/TimDettmers/bitsandbytes my env ：GeForce3090 Driver Version: 510.47.03 CUDA Version: 11.6

【fix todo】 git clone https://github.com/timdettmers/bitsandbytes.git cd bitsandbytes CUDA_VERSION=116 make cuda116 python setup.py install

Apr 28 '23 15:04 darrenwang00

I had the same issue when I wanted to load the model in 8bit. Loading the model in 4bit solved my problem. load-in-4bit=True

Nov 23 '23 00:11 bekhzod-olimov

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Jan 05 '24 23:01 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

"cublasLt ran into an error" with older GPU in 8-bit mode

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

text-generation-webui
text-generation-webui copied to clipboard