text-generation-webui
text-generation-webui copied to clipboard
WizardLM support
Describe the bug
I am testing out the new wizardLM 4bit gptq and it works great with agent LLM (compared to vicuna, however it is very very slow 5x slower than vicuna on 3060) sometimes when reponding the LLM responds with " Gradio HTTP request redirected to localhost :) 🔴 xformers not found! Please install it before trying to use it. bin C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll Loading wizardLM-7B-GPTQ-4bit-128g... Found the following quantized model: models\wizardLM-7B-GPTQ-4bit-128g\wizardLM-7B-GPTQ-4bit-128g.pt Loading model ... Done. Using the following device map for the quantized model: {'': 0} Replaced attention with xformers_attention Loaded the model in 8.81 seconds. Starting streaming server at ws://0.0.0.0:5005/api/v1/stream Starting API at http://0.0.0.0:5000/api Running on local URL: http://0.0.0.0:7860
To create a public link, set share=True in launch().
127.0.0.1 - - [28/Apr/2023 04:04:20] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply
output = shared.model.generate(**generate_params)[0]
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0
Output generated in 97.55 seconds (0.00 tokens/s, 0 tokens, context 1094, seed 271358257)
127.0.0.1 - - [28/Apr/2023 04:06:02] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 129.04 seconds (1.43 tokens/s, 185 tokens, context 154, seed 145439849)
127.0.0.1 - - [28/Apr/2023 04:08:12] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 137.52 seconds (1.24 tokens/s, 170 tokens, context 315, seed 543369230)
127.0.0.1 - - [28/Apr/2023 04:10:31] "POST /api/v1/generate HTTP/1.1" 200 -
"
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
Use --api with Wizard lm
If you want it to happen more often have the token size higher, but this is not an with vram because even with auto devices and the vram staying near 7/12 it tends to give me the
"
127.0.0.1 - - [28/Apr/2023 04:10:31] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply
output = shared.model.generate(**generate_params)[0]
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf, nan or element < 0
Output generated in 203.62 seconds (0.00 tokens/s, 0 tokens, context 1110, seed 838518441)
127.0.0.1 - - [28/Apr/2023 04:13:56] "POST /api/v1/generate HTTP/1.1" 200 -
'
This would also likely happen with the regular webui and no api, havent testing it tho.
Screenshot
No response
Logs
radio HTTP request redirected to localhost :)
🔴 xformers not found! Please install it before trying to use it.
bin C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
Loading wizardLM-7B-GPTQ-4bit-128g...
Found the following quantized model: models\wizardLM-7B-GPTQ-4bit-128g\wizardLM-7B-GPTQ-4bit-128g.pt
Loading model ...
Done.
Using the following device map for the quantized model: {'': 0}
Replaced attention with xformers_attention
Loaded the model in 8.81 seconds.
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Starting API at http://0.0.0.0:5000/api
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
127.0.0.1 - - [28/Apr/2023 04:04:20] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply
output = shared.model.generate(**generate_params)[0]
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 97.55 seconds (0.00 tokens/s, 0 tokens, context 1094, seed 271358257)
127.0.0.1 - - [28/Apr/2023 04:06:02] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 129.04 seconds (1.43 tokens/s, 185 tokens, context 154, seed 145439849)
127.0.0.1 - - [28/Apr/2023 04:08:12] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 137.52 seconds (1.24 tokens/s, 170 tokens, context 315, seed 543369230)
127.0.0.1 - - [28/Apr/2023 04:10:31] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply
output = shared.model.generate(**generate_params)[0]
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 203.62 seconds (0.00 tokens/s, 0 tokens, context 1110, seed 838518441)
127.0.0.1 - - [28/Apr/2023 04:13:56] "POST /api/v1/generate HTTP/1.1" 200 -
System Info
windows one click installed web ui. using wizardLM 4bit gptq nvidia 3060 12gb
Why doesn't it give the error Can't determine model type from model name. Please specify it manually using --model_type argument?
BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596
Personally, I've never gotten any GPTQ model working, so I can't help as much with that.
Setting use_cache to true in the model's config.json file fixes the speed. The other models already have it as true, so it's not because of compatibility.
Its strange it works sometimes in 4 bit mode, but when running the regular wizzard 7b it just spits about a bunch of garbage.
Why doesn't it give the error
Can't determine model type from model name. Please specify it manually using --model_type argument?BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596
Personally, I've never gotten any GPTQ model working, so I can't help as much with that.
Thank you, I'll try this!
Update that totally worked 5-10 t/s
Setting
use_cachetotruein the model'sconfig.jsonfile fixes the speed. The other models already have it astrue, so it's not because of compatibility.
Why doesn't it give the error
Can't determine model type from model name. Please specify it manually using --model_type argument?BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596
Personally, I've never gotten any GPTQ model working, so I can't help as much with that.
I have been getting this error. Do you know what I can do to resolve it? I'm not very familiar with Oobabooga's UI or WizardLM.
I fixed that error in my pull request. But like the error message says, you can add --model_type llama to the command line parameters inside start-webui.bat or in webui.py if you don't have a start-webui.bat.
Why doesn't it give the error
Can't determine model type from model name. Please specify it manually using --model_type argument?BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596
Personally, I've never gotten any GPTQ model working, so I can't help as much with that.
Is it supposed to be giving the can't determine model name? I assume not so if anyone can tell me how to fix it I will appreciate it.
Check the config json file, it might not be say llamatokenizer.
Update im getting a bug where I can load the wizardlm model but if I unload it, it gives be a allocation error of 2 gigs even though my gpu and or cpu when I do auto devices and more than enough, really weird, the only way to get it to load is to restart my pc then it loads perfectly fine... the first time until I unload it and the same thing happens again.
Why doesn't it give the error
Can't determine model type from model name. Please specify it manually using --model_type argument?BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596
Personally, I've never gotten any GPTQ model working, so I can't help as much with that.
Is it supposed to be giving the can't determine model name? I assume not so if anyone can tell me how to fix it I will appreciate it.
Oh man!
that "use_cache": true, line in config.json is SOOOO MUCH of a difference. why isnt that there out of the box?
thanks @kaiio14 for this!!!
Setting
use_cachetotruein the model'sconfig.jsonfile fixes the speed. The other models already have it astrue, so it's not because of compatibility.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.