text-generation-webui LLAMA 13B HF: RuntimeError: probability tensor contains either `inf`, `nan` or element

I'm not sure if this is a problem with the weights or the system, but when I try to generate text, it gives me this error.

File "/root/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 374, in run_predict output = await app.get_blocks().process_api( File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1017, in process_api result = await self.call_function( File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 849, in call_function prediction = await anyio.to_thread.run_sync( File "/root/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread return await future File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run result = context.run(func, *args) File "/root/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 453, in async_iteration return next(iterator) File "/workspace/text-generation-webui/modules/text_generation.py", line 189, in generate_reply output = eval(f"shared.model.generate({', '.join(generate_params)}){cuda}")[0] File "", line 1, in File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1452, in generate return self.sample( File "/root/miniconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2504, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0

I'm using these weights: https://huggingface.co/decapoda-research/llama-13b-hf IIRC there was an update to transformers that changed the way that the converted weights operate, but it looks as if it was already fixed.

Mar 08 '23 21:03 GamerUntouch

Just ran into this one as well.

Mar 08 '23 22:03 lxe

https://github.com/huggingface/transformers/pull/21955 is referencing this error. I think you're right -- it's related.

Mar 08 '23 22:03 lxe

Are you sure you are using right weights? The code was refactored so that previously converted weights are no longer valid and, from what I've seen, model outputs NaNs on the old weights. P.S. the weights were reuploaded some time after the refactor so they must have been reconverted. Nevertheless, it's probably related

Mar 08 '23 22:03 BlackSamorez

Your GPU might not support bitsandbytes officially yet.

Hardware requirements:

LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older). 8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X).

https://github.com/TimDettmers/bitsandbytes

Mar 09 '23 00:03 oobabooga

Just found the same error with LLaMA 7B when playing with the generation parameters. Apparently if top_p = 0 it will prompt me with a probability tensor contains either inf, nan or element < 0

Note: Using a RTX3080 12GB VRAM

Mar 09 '23 03:03 Tenkoni

top_p = 0 doesn't make sense. Set it to 1 if you don't want to use top_p.

Mar 09 '23 04:03 oobabooga

There have been several important updates to the transformers llama support branch. @oobabooga can you please sync the fork you're pulling in requirements.txt? The new conversion is not compatible with your fork.

Mar 09 '23 17:03 dustydecapod

I am working on it. It seems like the new implementation is required for 4-bit, so I will be forced to update earlier than expected.

Mar 09 '23 17:03 oobabooga

I'll start tagging my conversions on huggingface.co, so folks can pull the exact weights that are currently supported by your releases. That will prevent further conversion updates from interfering with your users using the weights.

Mar 09 '23 17:03 dustydecapod

If I disable do_sample, I no longer get this error.

Mar 09 '23 17:03 gururise

@oobabooga once you update your fork, i've tagged the conversions compatible with the zphang code as of this moment as "1.0-a1" on hugging face. you might want to add this to your README, along with a link to https://huggingface.co/decapoda-research where the weights live, and these instructions on how to download a specific version from the hub: https://huggingface.co/docs/huggingface_hub/v0.13.1/guides/download#from-specific-version

Mar 09 '23 17:03 dustydecapod

Downloading a branch is easy with the script included in this repository:

python download-model.py decapoda-research/llama-7b-hf --branch 1.0-a1

I could make the update immediately, but I am worried about this difference between the previous and the current implementations: https://github.com/huggingface/transformers/pull/21955#issuecomment-1462540212

Mar 09 '23 18:03 oobabooga

If I disable do_sample, I no longer get this error.

Same for me, but the output is unintelligible, e.g. Answer: ?? ?? ?? ?? ??

Otherwise receiving error RuntimeError: probability tensor contains either inf, nan or element < 0

Mar 10 '23 12:03 sparbz

@oobabooga once you update your fork, i've tagged the conversions compatible with the zphang code as of this moment as "1.0-a1" on hugging face. you might want to add this to your README, along with a link to https://huggingface.co/decapoda-research where the weights live, and these instructions on how to download a specific version from the hub: https://huggingface.co/docs/huggingface_hub/v0.13.1/guides/download#from-specific-version

@zoidbb Noticed the 30b and 65b int4 models have no files in your link you provided. Do you plan also provide int4 versions of these two models? Thanks!

Mar 10 '23 16:03 gururise

This need to be somehow explained a bit better- no LLAMA I tried works with the current version - normal, 8 bit, all giving me the same error as here. I need to d/l new HF model? Is that it, or I completely misread the comments?

Mar 10 '23 20:03 FartyPants

@FartyPants Everything has been tested using the huggingface weights from decapoda-research (hi, thats me). If you have trouble getting things working using those known-good weights, let me know ASAP and I can help you figure out if its a code, weight, or user error.

Mar 10 '23 20:03 dustydecapod

@zoidbb I think that many people are downloading the main branches of the *-hf decapoda-research repositories instead of the 1.0-a1 branches. Is it possible to rename

main -> old-implementation
1.0-a1 -> main

or similar? Since the old conversions are basically deprecated now.

Mar 10 '23 20:03 oobabooga

It’s a tag, not a branch. Main branch is currently in sync with 1.0-a1

Mar 10 '23 20:03 dustydecapod

Ah I see, that's good then.

Mar 10 '23 20:03 oobabooga

If I disable do_sample, I no longer get this error.

Same for me, but the output is unintelligible, e.g. Answer: ?? ?? ?? ?? ??

Otherwise receiving error RuntimeError: probability tensor contains either inf, nan or element < 0

Had the same exact issue as you after cloning the models on huggingface. For whatever reason though when I downloaded the model off the magnet link it worked perfectly fine.

Mar 11 '23 01:03 PrintedWorks

If I disable do_sample, I no longer get this error.

Same for me, but the output is unintelligible, e.g. Answer: ?? ?? ?? ?? ?? Otherwise receiving error RuntimeError: probability tensor contains either inf, nan or element < 0

Had the same exact issue as you after cloning the models on huggingface. For whatever reason though when I downloaded the model off the magnet link it worked perfectly fine.

Interesting, I have only tried via the magnet link so perhaps I am doing something wrong. Will try again now.

Edit: Using the HFv2 Weights from the magnet link, all is working now. Thanks for the help.

Mar 11 '23 11:03 sparbz

How to get HFv2 Weights ? Thanks in advance.

Mar 13 '23 14:03 taomanwai

Got the same issue after chatting for a while

Output generated in 27.77 seconds (0.65 tokens/s, 18 tokens) Output generated in 28.59 seconds (0.73 tokens/s, 21 tokens) Output generated in 29.45 seconds (0.51 tokens/s, 15 tokens) Output generated in 33.81 seconds (0.80 tokens/s, 27 tokens) Output generated in 27.49 seconds (0.18 tokens/s, 5 tokens) Output generated in 29.19 seconds (0.27 tokens/s, 8 tokens) Output generated in 28.65 seconds (0.28 tokens/s, 8 tokens) Output generated in 31.83 seconds (0.53 tokens/s, 17 tokens) Output generated in 29.37 seconds (0.17 tokens/s, 5 tokens) Output generated in 29.62 seconds (0.14 tokens/s, 4 tokens) Output generated in 33.77 seconds (0.44 tokens/s, 15 tokens) Exception in thread Thread-33 (gentask): Traceback (most recent call last): File "D:\TextGen\venv\4bit\lib\threading.py", line 1016, in _bootstrap_inner self.run() File "D:\TextGen\venv\4bit\lib\threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "D:\TextGen\text-generation-webui\modules\callbacks.py", line 64, in gentask ret = self.mfunc(callback=_callback, **self.kwargs) File "D:\TextGen\text-generation-webui\modules\text_generation.py", line 191, in generate_with_callback shared.model.generate(**kwargs) File "D:\TextGen\venv\4bit\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "D:\TextGen\venv\4bit\lib\site-packages\transformers\generation\utils.py", line 1452, in generate return self.sample( File "D:\TextGen\venv\4bit\lib\site-packages\transformers\generation\utils.py", line 2504, in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either inf, nan or element < 0

Got my model from https://huggingface.co/decapoda-research/llama-30b-hf-int4/discussions/1#640ea17dade771d6c505c850

Mar 14 '23 07:03 nuke777

top_p = 0 doesn't make sense. Set it to 1 if you don't want to use top_p.

I have tested this, and found if top_p = 0, definitely will result in the "RuntimeError: probability tensor contains either inf, nan or element < 0" error. My own experiences show that this parameter could not be set to 0, even i set it to be top_p = 0.18, the error gone.

Mar 16 '23 16:03 guruace

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Apr 16 '23 23:04 github-actions[bot]

you could set p to 0.9999. This shouldn't make much of a difference compared to 1.0 but really small probabilities that might lead to nan, inf etc. are filtered out. Here is more information to the top p sampling method https://huggingface.co/blog/how-to-generate

Apr 24 '23 22:04 SinclairSchneider

Maybe it's a good idea to replace top_p=1 with top_p=0.9999 automatically in the web UI? What is the highest threshold before this stops working?

Apr 24 '23 22:04 oobabooga

I just saw this discussion while I was looking for a solution to the error "probability tensor contains either inf, nan or element < 0". I solved it in my project by changing p to a bit blow one. So I downloded the web UI and tried to reproduce the problem by also using LLAMA 13B but for me it's working. So was there a special input that provoked the error that I have to use to reproduce it? Or any other spacial settings? Otherwise the solution is more like a workaround.

Apr 25 '23 01:04 SinclairSchneider

I ran into same issue for 7B model It got fixed if I turn off the low_resource: in eval_configs/minigpt4_eval.yaml, line 8. low_resource: False But more VRAM is needed. (16GB VRAM)

Apr 26 '23 18:04 ttio2tech

this is a weird one, I get the same thing with 13B, 7B works... but when I load the same 13B model with just transformers as below it works:

from transformers import AutoTokenizer, LlamaForCausalLM, LlamaTokenizer


tokenizer = LlamaTokenizer.from_pretrained('MY_MODEL')
model = LlamaForCausalLM.from_pretrained('MY_MODEL')

prompt = "Hey, are you consciours? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate
generate_ids = model.generate(inputs.input_ids, max_length=30)
response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(response)

also if the model is loaded with a LoRA it also works... very strange.

They are also both using the same conda env.

May 01 '23 04:05 getorca

text-generation-webui
text-generation-webui copied to clipboard

LLAMA 13B HF: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

text-generation-webui text-generation-webui copied to clipboard

LLAMA 13B HF: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

text-generation-webui
text-generation-webui copied to clipboard