text-generation-webui WizardLM support

Describe the bug

I am testing out the new wizardLM 4bit gptq and it works great with agent LLM (compared to vicuna, however it is very very slow 5x slower than vicuna on 3060) sometimes when reponding the LLM responds with " Gradio HTTP request redirected to localhost :) 🔴 xformers not found! Please install it before trying to use it. bin C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll Loading wizardLM-7B-GPTQ-4bit-128g... Found the following quantized model: models\wizardLM-7B-GPTQ-4bit-128g\wizardLM-7B-GPTQ-4bit-128g.pt Loading model ... Done. Using the following device map for the quantized model: {'': 0} Replaced attention with xformers_attention Loaded the model in 8.81 seconds. Starting streaming server at ws://0.0.0.0:5005/api/v1/stream Starting API at http://0.0.0.0:5000/api Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch(). 127.0.0.1 - - [28/Apr/2023 04:04:20] "POST /api/v1/generate HTTP/1.1" 200 - Traceback (most recent call last): File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply output = shared.model.generate(**generate_params)[0] File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0 Output generated in 97.55 seconds (0.00 tokens/s, 0 tokens, context 1094, seed 271358257) 127.0.0.1 - - [28/Apr/2023 04:06:02] "POST /api/v1/generate HTTP/1.1" 200 - Output generated in 129.04 seconds (1.43 tokens/s, 185 tokens, context 154, seed 145439849) 127.0.0.1 - - [28/Apr/2023 04:08:12] "POST /api/v1/generate HTTP/1.1" 200 - Output generated in 137.52 seconds (1.24 tokens/s, 170 tokens, context 315, seed 543369230) 127.0.0.1 - - [28/Apr/2023 04:10:31] "POST /api/v1/generate HTTP/1.1" 200 - "

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Use --api with Wizard lm If you want it to happen more often have the token size higher, but this is not an with vram because even with auto devices and the vram staying near 7/12 it tends to give me the " 127.0.0.1 - - [28/Apr/2023 04:10:31] "POST /api/v1/generate HTTP/1.1" 200 - Traceback (most recent call last): File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply output = shared.model.generate(**generate_params)[0] File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate return self.sample( File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0 Output generated in 203.62 seconds (0.00 tokens/s, 0 tokens, context 1110, seed 838518441) 127.0.0.1 - - [28/Apr/2023 04:13:56] "POST /api/v1/generate HTTP/1.1" 200 - ' This would also likely happen with the regular webui and no api, havent testing it tho.

Screenshot

No response

Logs

radio HTTP request redirected to localhost :)
🔴 xformers not found! Please install it before trying to use it.
bin C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
Loading wizardLM-7B-GPTQ-4bit-128g...
Found the following quantized model: models\wizardLM-7B-GPTQ-4bit-128g\wizardLM-7B-GPTQ-4bit-128g.pt
Loading model ...
Done.
Using the following device map for the quantized model: {'': 0}
Replaced attention with xformers_attention
Loaded the model in 8.81 seconds.
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Starting API at http://0.0.0.0:5000/api
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
127.0.0.1 - - [28/Apr/2023 04:04:20] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
  File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply
    output = shared.model.generate(**generate_params)[0]
  File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 97.55 seconds (0.00 tokens/s, 0 tokens, context 1094, seed 271358257)
127.0.0.1 - - [28/Apr/2023 04:06:02] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 129.04 seconds (1.43 tokens/s, 185 tokens, context 154, seed 145439849)
127.0.0.1 - - [28/Apr/2023 04:08:12] "POST /api/v1/generate HTTP/1.1" 200 -
Output generated in 137.52 seconds (1.24 tokens/s, 170 tokens, context 315, seed 543369230)
127.0.0.1 - - [28/Apr/2023 04:10:31] "POST /api/v1/generate HTTP/1.1" 200 -
Traceback (most recent call last):
  File "C:\Users\Nick\Desktop\4_26booga\text-generation-webui\modules\text_generation.py", line 272, in generate_reply
    output = shared.model.generate(**generate_params)[0]
  File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "C:\Users\Nick\Desktop\4_26booga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2560, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Output generated in 203.62 seconds (0.00 tokens/s, 0 tokens, context 1110, seed 838518441)
127.0.0.1 - - [28/Apr/2023 04:13:56] "POST /api/v1/generate HTTP/1.1" 200 -

System Info

windows one click installed web ui. using wizardLM 4bit gptq nvidia 3060 12gb

Apr 28 '23 08:04 NicolasMejiaPetit

Why doesn't it give the error Can't determine model type from model name. Please specify it manually using --model_type argument?

BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596

Personally, I've never gotten any GPTQ model working, so I can't help as much with that.

Apr 28 '23 09:04 CarlKenner

Setting use_cache to true in the model's config.json file fixes the speed. The other models already have it as true, so it's not because of compatibility.

Apr 28 '23 12:04 kaiio14

Its strange it works sometimes in 4 bit mode, but when running the regular wizzard 7b it just spits about a bunch of garbage.

Why doesn't it give the error Can't determine model type from model name. Please specify it manually using --model_type argument?

BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596

Personally, I've never gotten any GPTQ model working, so I can't help as much with that.

Apr 28 '23 23:04 NicolasMejiaPetit

Thank you, I'll try this!

Update that totally worked 5-10 t/s

Setting use_cache to true in the model's config.json file fixes the speed. The other models already have it as true, so it's not because of compatibility.

Apr 28 '23 23:04 NicolasMejiaPetit

Why doesn't it give the error Can't determine model type from model name. Please specify it manually using --model_type argument?

BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596

Personally, I've never gotten any GPTQ model working, so I can't help as much with that.

I have been getting this error. Do you know what I can do to resolve it? I'm not very familiar with Oobabooga's UI or WizardLM.

Apr 29 '23 01:04 Neverdusk

I fixed that error in my pull request. But like the error message says, you can add --model_type llama to the command line parameters inside start-webui.bat or in webui.py if you don't have a start-webui.bat.

Apr 29 '23 10:04 CarlKenner

Why doesn't it give the error Can't determine model type from model name. Please specify it manually using --model_type argument?

BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596

Personally, I've never gotten any GPTQ model working, so I can't help as much with that.

Is it supposed to be giving the can't determine model name? I assume not so if anyone can tell me how to fix it I will appreciate it.

Apr 29 '23 13:04 Yoad1704

Check the config json file, it might not be say llamatokenizer.

Update im getting a bug where I can load the wizardlm model but if I unload it, it gives be a allocation error of 2 gigs even though my gpu and or cpu when I do auto devices and more than enough, really weird, the only way to get it to load is to restart my pc then it loads perfectly fine... the first time until I unload it and the same thing happens again.

Why doesn't it give the error Can't determine model type from model name. Please specify it manually using --model_type argument?

BTW, WizardLM requires instructions to be in a specific format. See my pull request #1596

Personally, I've never gotten any GPTQ model working, so I can't help as much with that.

Is it supposed to be giving the can't determine model name? I assume not so if anyone can tell me how to fix it I will appreciate it.

Apr 29 '23 14:04 NicolasMejiaPetit

Oh man!

that "use_cache": true, line in config.json is SOOOO MUCH of a difference. why isnt that there out of the box?

thanks @kaiio14 for this!!!

Setting use_cache to true in the model's config.json file fixes the speed. The other models already have it as true, so it's not because of compatibility.

May 03 '23 13:05 Highpressure

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Aug 26 '23 23:08 github-actions[bot]

text-generation-webui text-generation-webui copied to clipboard

WizardLM support

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard