text-generation-webui
text-generation-webui copied to clipboard
"cublasLt ran into an error" with older GPU in 8-bit mode
Describe the bug
my device is GTX 1650 4GB,i5-12400 , 40BG RAM. Ubuntu 20.04. cuda11.8
I have set llama-7b according to the wiki
I can run it with python server.py --listen --auto-devices --model llama-7b
and everything goes well!
But I can't run with --load-in-8bit
. According to https://github.com/oobabooga/text-generation-webui/pull/366 I should use this.
when I begin with python server.py --listen --auto-devices --model llama-7b --load-in-8bit
There is no error, everything seeming good,BUT once I use the web ui click the ‘Generate’ button,
there error comes in the terminal
(textgen) wk:text-generation-webui$ python server.py --listen --auto-devices --model llama-7b --load-in-8bit
Loading llama-7b...
Auto-assiging --gpu-memory 3 for your GPU to try to prevent out-of-memory errors.
You can manually set other values.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading checkpoint shards: 100%|████████████████| 33/33 [00:06<00:00, 4.81it/s]
Loaded the model in 7.58 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 4096]), B: torch.Size([4096, 4096]), C: (16, 4096); (lda, ldb, ldc): (c_int(512), c_int(131072), c_int(512)); (m, n, k): (c_int(16), c_int(4096), c_int(4096))
Exception in thread Thread-4 (gentask):
error detectedTraceback (most recent call last):
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 772, in forward
outputs = self.model(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 621, in forward
layer_outputs = decoder_layer(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 316, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 216, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
this is not only happened for llama-7b , it can easily reproduction in anyother models
like:
run python server.py --listen --model opt-1.3b --load-in-8bit
There is no error, BUT once use the web ui enter anything and click the ‘Generate’ button,
there error comes in the terminal, it seems the bug has something to do with the cublasLt
, like a cuda bug.
and there is no bug wih cpu python server.py --listen --model opt-1.3b --load-in-8bit
it's going well.
Screenshot
No response
Logs
(textgen) wk:text-generation-webui$ python server.py --listen --model opt-1.3b --load-in-8bit
Loading opt-1.3b...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/wk/anaconda3/envs/textgen did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loaded the model in 3.34 seconds.
/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
cuBLAS API failed with status 15
A: torch.Size([16, 2048]), B: torch.Size([2048, 2048]), C: (16, 2048); (lda, ldb, ldc): (c_int(512), c_int(65536), c_int(512)); (m, n, k): (c_int(16), c_int(2048), c_int(2048))
Exception in thread Thread-3 (gentask):
error detectedTraceback (most recent call last):
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/wk/data/text-generation-webui/modules/callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/wk/data/text-generation-webui/modules/text_generation.py", line 196, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2468, in sample
outputs = self(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 930, in forward
outputs = self.model.decoder(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 696, in forward
layer_outputs = decoder_layer(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 326, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 171, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 377, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/wk/anaconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1410, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
### System Info
```shell
my device is GTX 1650 4GB,i5-12400 , 40BG RAM. Ubuntu 20.04. cuda11.8
I got the same error. I use 2 GPU and trying to run pygmalion-2.7b with 8bit. I use Windows.
My start-webui.bat file:
call python server.py --auto-devices --cai-chat --share --gpu-memory 5 3 --load-in-8bit
I made also that: https://www.reddit.com/r/PygmalionAI/comments/1115gom/running_pygmalion_6b_with_8gb_of_vram/ But I use libbitsandbytes_cudaall.dll for my GeForce 1660 + 960 cards.
This also happens to me on a GTX 1650 GPU.
I think 8bit in bitsandbytes requires Turing(20xx) or later: https://github.com/TimDettmers/bitsandbytes#requirements--installation
LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older).
On older GPU it will NEVER work with int8 threshold at 6. But I get NaN error and not this error on my P6000. I am using the pre "fixed" bits and bytes that never completed the "cuda setup" part.
I'll try it with the new bits and bytes that I don't have to patch and see if I get this error instead.
But best believe that it is possible.
So i have been having this error too.
My setup is: Ryzen 5800x, 32GB DDR4 (25gb ZRAM compressed swap), 3060ti (8GB) and 2080super (8GB) Ubuntu 22.04 Cuda 11.8 pytorch2.0+cu118
I got it to generate by setting export CUDA_VISIBLE_DEVICES=1,0
(in Windows, use set CUDA_VISIBLE_DEVICES=1,0
but i havent tested that yet)
Noting i swapped the numbers around. Setting it 0,1 always resulted in the error
Doesnt help those that have single GPUs, but its a start i hope.
I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.
I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the .so
again like windows users use a fixed .dll
? Not sure.
I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it. See if the instructions here - https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ help you? I was finally able to get 4bit working after following the instructions here.
I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.
I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the
.so
again like windows users use a fixed.dll
I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it.
I compiled from source for bitsandbytes as well as trying the pip package, just to avoid the issue of the .so 4bit works. its only 8bit that was causing me headaches. And it looks like we need 8bit to use LoRAs in LLaMa
I compiled from source for bitsandbytes as well as trying the pip package, just to avoid the issue of the .so 4bit works. its only 8bit that was causing me headaches. And it looks like we need 8bit to use LoRAs in LLaMa
I think your issue might be something related to an improper installation because from what I understand these 8bit issues are only on older GPUs from 1xxx series and lower. Your 2080 Super and 3060ti are perfectly compatible even with the native int8 function from bitsandbytes, you shouldn't have any need to even compile from source...
Perhaps try running in 16 bit.. You have 16GB VRAM which should be more than enough.
@lolxdmainkaisemaanlu thank you,
can you tell where should I put this
bitsandbytes-prebuilt-all_arch/0.37.0/libbitsandbytes_cudaall.dll
to which folder ?
and do I have to change code about this webui?
installer_files\env\lib\site-packages\bitsandbytes\
Put it here, but still there is the same bug for me.
I am on Windows 11 and I am able to load the LLama 7b model in 4bit on my GTX 1060 6GB using the 'allarch' 0.37.0 bitsandbytes from this repo - https://github.com/james-things/bitsandbytes-prebuilt-all_arch.
I thought it would be working natively on Linux since the author of bitsandbytes made the int8 function backward compatible so that even Pascal cards can run it. Perhaps you need to compile the
.so
again like windows users use a fixed.dll
? Not sure.I'm sure there is a solution to this 110%. My card is older than yours and 4bit is working fine on it. See if the instructions here - https://www.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_install_llama_8bit_and_4bit/ help you? I was finally able to get 4bit working after following the instructions here.
Did you try to run it in 8bit? Do you have error then or not?
I come here to tell you that the new accepted transformers is slow for me and I have no clue what is wrong on your cards and why mine works.
I patch models.py like this: https://pastebin.com/siPxZvkc
And then I can generate away: https://pastebin.com/R3JCmJ9L
I can even do the lora just fine.
Fixed bits and bytes from pypy works, its just more verbose in messages.
I also have this error on GTX 1660 Ti. I'm guessing this means GTX 16XX series isn't compatible despite also being Turing architecture.
Looks like GTX 16XX does support 8-bit, it just wasn't enabled in bitsandbytes until now. https://github.com/TimDettmers/bitsandbytes/pull/292 So starting with bitsandbytes 0.38.0 these GPUs should work.
EDIT: Just tested with bitsandbytes upgraded to 0.38.0.post2 on GTX 1660 Ti and it works perfectly.
try rebuild bitsandbytes from https://github.com/TimDettmers/bitsandbytes my env :GeForce3090 Driver Version: 510.47.03 CUDA Version: 11.6
【fix todo】 git clone https://github.com/timdettmers/bitsandbytes.git cd bitsandbytes CUDA_VERSION=116 make cuda116 python setup.py install
I had the same issue when I wanted to load the model in 8bit. Loading the model in 4bit solved my problem. load-in-4bit=True
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.