text-generation-webui RuntimeError when loading gpt4-x-alpaca or vicuna

Describe the bug

Getting the following RunTime error when trying to use one of the following model.

If i run the server with: python server.py --auto-devices --chat and choose the decapoda-research_llama-7b-hf model It works just fine.

I have used the windows installer to install everything. (have tried reinstalling)

It seems to be an issue with only 4bits models that I currently downloaded. Is it because of GPU compatibility issues? not enough vram?

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

run server with command: server.py --auto-devices --chat --wbits 4 --groupsize 128

choose either: gpt4-x-alpaca-13b-native-4bit-128g (cuda | https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g) or vicuna-13b-GPTQ-4bit-128g (https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g)

Screenshot

No response

Logs

Starting the web UI...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: G:\oobabooga-windows\installer_files\env\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 117
G:\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\cuda_setup\main.py:141: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary G:\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117_nocublaslt.dll...
The following models are available:

1. decapoda-research_llama-7b-hf
2. gpt4-x-alpaca-13b-native-4bit-128g
3. vicuna-13b-GPTQ-4bit-128g

Which one do you want to load? 1-3

2

Loading gpt4-x-alpaca-13b-native-4bit-128g...
Loading model ...
Done.
Traceback (most recent call last):
  File "G:\oobabooga-windows\text-generation-webui\server.py", line 302, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "G:\oobabooga-windows\text-generation-webui\modules\models.py", line 176, in load_model
    tokenizer = LlamaTokenizer.from_pretrained(Path(f"{shared.args.model_dir}/{shared.model_name}/"), clean_up_tokenization_spaces=True)
  File "G:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "G:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "G:\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\tokenization_llama.py", line 96, in __init__
    self.sp_model.Load(vocab_file)
  File "G:\oobabooga-windows\installer_files\env\lib\site-packages\sentencepiece\__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "G:\oobabooga-windows\installer_files\env\lib\site-packages\sentencepiece\__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: D:\a\sentencepiece\sentencepiece\src\sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
Press any key to continue . . .

System Info

GPU: GTX 1080 8gb
CPU: i7 6850k
RAM: 48GB
OS: Windows 10

Apr 10 '23 19:04 abdullahbaa5

I get the exact same error

Apr 10 '23 22:04 jan-tennert

I get the exact same error

what are your system specs?

Apr 10 '23 22:04 abdullahbaa5

I'm using an AMD GPU (RX 5500 XT) using Rocm and Triton on Linux ~~, but I might have a different problem? I can load the default models, but trying to generate gives me a segmentation fault error :thinking:~~ ignore that driver issue, fixed it

Apr 10 '23 22:04 jan-tennert

I'm using an AMD GPU (RX 5500 XT) using Rocm and Triton on Linux, but I might have a different problem? I can load the default models, but trying to generate gives me a segmentation fault error 🤔

I just tried out a non quantized model for gpt4-x-alpaca and it worked fine... https://huggingface.co/chavinlo/gpt4-x-alpaca

I am sure the issue has to do with the quantization either not being supported as mine does give out the error: "Only slow 8-bit matmul is supported for your GPU!" not sure if it is related to that

Apr 10 '23 22:04 abdullahbaa5

I've got the same error and I've fixed it. Or I have successfully launched the webui and I can chat. But...I frankly I still don't know what went wrong. To clear things up, This oobabooga webui was designed to be running in linux, not windows. They got a hack that can run natively in windows though. All of our problems are from running from windows.

Why we saw this error? I guessed. could be wrong, but here is my guess: We shouldn't follow the instructions on the readme.md or any instruciton in gptq folder like this one: pip install -r requirements.txt Those are for linux users or who installed WSL in windows. WSL is a subsystem for windows to run linux commands. Which requires hyper-v, which mess with my vmware installation, and I can't install WSL...

Here are the steps I took (and I wrote as a memo for myself)

Install conda and create an env, activate that env
download this windows 1-click installer repo, git clone xxx

https://github.com/oobabooga/one-click-installers
go there and execute the install.bat in command prompt. it will take some time (5-10 minutes). and it will install a windows hack version of gptq, which is officially not supported on windows. bitsandbytes was also not supported on windows but they got a hack version installed too.
After installation, you should see there is a textwebgui folder. Go there and execute the following command to download vicuna quant 4 bits model using the downloader provided.

cd text-generation-webui python download-model.py --text-only anon8231489123/vicuna-13b-GPTQ-4bit-128g

parameter --text-only tells the downloader to download small text configs, not the model. Download the model

https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g/resolve/main/vicuna-13b-4bit-128g.safetensors

Move the downloaded model file to oobabooga-windows\text-generation-webui\models\anon8231489123_vicuna-13b-GPTQ-4bit-128g folder and done.

Run the model

python server.py --model anon8231489123_vicuna-13b-GPTQ-4bit-128g --model_type llama --chat --wbits 4 --groupsize 128

Apr 11 '23 02:04 shawhu

I'm using an AMD GPU (RX 5500 XT) using Rocm and Triton on Linux, but I might have a different problem? I can load the default models, but trying to generate gives me a segmentation fault error thinking

I just tried out a non quantized model for gpt4-x-alpaca and it worked fine... https://huggingface.co/chavinlo/gpt4-x-alpaca

I am sure the issue has to do with the quantization either not being supported as mine does give out the error: "Only slow 8-bit matmul is supported for your GPU!" not sure if it is related to that

try and remove --groupsize 128 from command. could you load 4bit 13B on 8GB VRAM without --pre_layers offload?

Apr 11 '23 17:04 ghost

I've got the same error and I've fixed it. Or I have successfully launched the webui and I can chat. But...I frankly I still don't know what went wrong. To clear things up, This oobabooga webui was designed to be running in linux, not windows. They got a hack that can run natively in windows though. All of our problems are from running from windows.

Why we saw this error? I guessed. could be wrong, but here is my guess: We shouldn't follow the instructions on the readme.md or any instruciton in gptq folder like this one: pip install -r requirements.txt Those are for linux users or who installed WSL in windows. WSL is a subsystem for windows to run linux commands. Which requires hyper-v, which mess with my vmware installation, and I can't install WSL...

Here are the steps I took (and I wrote as a memo for myself)
0. Install conda and create an env, activate that env

1. download this windows 1-click installer repo, git clone xxx
   https://github.com/oobabooga/one-click-installers

2. go there and execute the install.bat in command prompt. it will take some time (5-10 minutes). and it will install a windows hack version of gptq, which is officially not supported on windows. bitsandbytes  was also not supported on windows but they got a hack version installed too.

3. After installation, you should see there is a textwebgui folder. Go there and execute the following command to download vicuna quant 4 bits model using the downloader provided.
   cd text-generation-webui
   python download-model.py --text-only anon8231489123/vicuna-13b-GPTQ-4bit-128g
parameter --text-only tells the downloader to download small text configs, not the model. Download the model
https://huggingface.co/anon8231489123/vicuna-13b-GPTQ-4bit-128g/resolve/main/vicuna-13b-4bit-128g.safetensors
Move the downloaded model file to oobabooga-windows\text-generation-webui\models\anon8231489123_vicuna-13b-GPTQ-4bit-128g folder and done.
5. Run the model
   python server.py --model anon8231489123_vicuna-13b-GPTQ-4bit-128g --model_type llama --chat --wbits 4 --groupsize 128

Getting this error too, but I'm on linux, normal models work completely fine.

Apr 11 '23 20:04 jan-tennert

@jan-tennert

I haven't tested it on linux although I might do a linux box soon... There's a youtube video teaching how to install it on a linux box.

https://www.youtube.com/watch?v=F_pFH-AngoE

The following are all from my guessing, By guessing...I would say that if you'd followed the installation instrucitons to the letter, then, maybe your problem could be fixed by

a) try to make sure the python version and the python environment are the one specified. And all the requiremetns are installed correctly. create a new environment and pip install from scratch. b) try to install this on a virtual pc that you can rent (with less vram and with sufficent vram to compare), and, to see if you can install and run. if you do, maybe consider fixing your linux box first. reinstall conda, reinstall nvidia driver...etc

A lot of problems are from python and it's "abysmal" package management. Personally I've been trying to reinstall everything from scratch hundreds of times. There are version conflicts, nvidia driver problems, cuda version and python wheels compatibility problems...And python is old, google search returns a lot of outdated information, try to filter those with a time window.

I hope it helps

Apr 12 '23 04:04 shawhu

I get this error, and I'm running on Linux

Apr 13 '23 00:04 dathide

I've been wrestling with this problem for the last couple days and finally managed to get it to work on Linux. Check out my guide here

Apr 14 '23 17:04 ltngonnguyen

check if you have the full tokenizer.model file (about 500kb)

Apr 20 '23 17:04 Strothis

check if you have the full tokenizer.model file (about 500kb)

fixed my issue -- anyone coming here make sure you pay attention to other files with the lfs pointer other than just the model when using git clone

Apr 22 '23 22:04 deepxmatter

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Oct 02 '23 23:10 github-actions[bot]

text-generation-webui
text-generation-webui copied to clipboard

RuntimeError when loading gpt4-x-alpaca or vicuna | 13b

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui text-generation-webui copied to clipboard

RuntimeError when loading gpt4-x-alpaca or vicuna | 13b

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard