text-generation-webui
text-generation-webui copied to clipboard
LLAMA 13B HF: RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I'm not sure if this is a problem with the weights or the system, but when I try to generate text, it gives me this error.
File "/root/miniconda3/lib/python3.10/site-packages/gradio/routes.py", line 374, in run_predict
output = await app.get_blocks().process_api(
File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 1017, in process_api
result = await self.call_function(
File "/root/miniconda3/lib/python3.10/site-packages/gradio/blocks.py", line 849, in call_function
prediction = await anyio.to_thread.run_sync(
File "/root/miniconda3/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/root/miniconda3/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/root/miniconda3/lib/python3.10/site-packages/gradio/utils.py", line 453, in async_iteration
return next(iterator)
File "/workspace/text-generation-webui/modules/text_generation.py", line 189, in generate_reply
output = eval(f"shared.model.generate({', '.join(generate_params)}){cuda}")[0]
File "inf
, nan
or element < 0
I'm using these weights: https://huggingface.co/decapoda-research/llama-13b-hf IIRC there was an update to transformers that changed the way that the converted weights operate, but it looks as if it was already fixed.
Just ran into this one as well.
https://github.com/huggingface/transformers/pull/21955 is referencing this error. I think you're right -- it's related.
Are you sure you are using right weights? The code was refactored so that previously converted weights are no longer valid and, from what I've seen, model outputs NaNs on the old weights. P.S. the weights were reuploaded some time after the refactor so they must have been reconverted. Nevertheless, it's probably related
Your GPU might not support bitsandbytes officially yet.
Hardware requirements:
LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older). 8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X).
https://github.com/TimDettmers/bitsandbytes
Just found the same error with LLaMA 7B when playing with the generation parameters.
Apparently if top_p = 0 it will prompt me with a probability tensor contains either
inf,
nan or element < 0
Note: Using a RTX3080 12GB VRAM
top_p = 0
doesn't make sense. Set it to 1 if you don't want to use top_p.
There have been several important updates to the transformers llama support branch. @oobabooga can you please sync the fork you're pulling in requirements.txt? The new conversion is not compatible with your fork.
I am working on it. It seems like the new implementation is required for 4-bit, so I will be forced to update earlier than expected.
I'll start tagging my conversions on huggingface.co, so folks can pull the exact weights that are currently supported by your releases. That will prevent further conversion updates from interfering with your users using the weights.
If I disable do_sample
, I no longer get this error.
@oobabooga once you update your fork, i've tagged the conversions compatible with the zphang code as of this moment as "1.0-a1" on hugging face. you might want to add this to your README, along with a link to https://huggingface.co/decapoda-research where the weights live, and these instructions on how to download a specific version from the hub: https://huggingface.co/docs/huggingface_hub/v0.13.1/guides/download#from-specific-version
Downloading a branch is easy with the script included in this repository:
python download-model.py decapoda-research/llama-7b-hf --branch 1.0-a1
I could make the update immediately, but I am worried about this difference between the previous and the current implementations: https://github.com/huggingface/transformers/pull/21955#issuecomment-1462540212
If I disable
do_sample
, I no longer get this error.
Same for me, but the output is unintelligible, e.g. Answer: ?? ?? ?? ?? ??
Otherwise receiving error RuntimeError: probability tensor contains either inf, nan or element < 0
@oobabooga once you update your fork, i've tagged the conversions compatible with the zphang code as of this moment as "1.0-a1" on hugging face. you might want to add this to your README, along with a link to https://huggingface.co/decapoda-research where the weights live, and these instructions on how to download a specific version from the hub: https://huggingface.co/docs/huggingface_hub/v0.13.1/guides/download#from-specific-version
@zoidbb Noticed the 30b and 65b int4 models have no files in your link you provided. Do you plan also provide int4 versions of these two models? Thanks!
This need to be somehow explained a bit better- no LLAMA I tried works with the current version - normal, 8 bit, all giving me the same error as here. I need to d/l new HF model? Is that it, or I completely misread the comments?
@FartyPants Everything has been tested using the huggingface weights from decapoda-research (hi, thats me). If you have trouble getting things working using those known-good weights, let me know ASAP and I can help you figure out if its a code, weight, or user error.
@zoidbb I think that many people are downloading the main branches of the *-hf
decapoda-research repositories instead of the 1.0-a1
branches. Is it possible to rename
main -> old-implementation
1.0-a1 -> main
or similar? Since the old conversions are basically deprecated now.
It’s a tag, not a branch. Main branch is currently in sync with 1.0-a1
Ah I see, that's good then.
If I disable
do_sample
, I no longer get this error.Same for me, but the output is unintelligible, e.g. Answer: ?? ?? ?? ?? ??
Otherwise receiving error
RuntimeError: probability tensor contains either inf, nan or element < 0
Had the same exact issue as you after cloning the models on huggingface. For whatever reason though when I downloaded the model off the magnet link it worked perfectly fine.
If I disable
do_sample
, I no longer get this error.Same for me, but the output is unintelligible, e.g. Answer: ?? ?? ?? ?? ?? Otherwise receiving error
RuntimeError: probability tensor contains either inf, nan or element < 0
Had the same exact issue as you after cloning the models on huggingface. For whatever reason though when I downloaded the model off the magnet link it worked perfectly fine.
Interesting, I have only tried via the magnet link so perhaps I am doing something wrong. Will try again now.
Edit: Using the HFv2 Weights from the magnet link, all is working now. Thanks for the help.
How to get HFv2 Weights ? Thanks in advance.
Got the same issue after chatting for a while
Output generated in 27.77 seconds (0.65 tokens/s, 18 tokens)
Output generated in 28.59 seconds (0.73 tokens/s, 21 tokens)
Output generated in 29.45 seconds (0.51 tokens/s, 15 tokens)
Output generated in 33.81 seconds (0.80 tokens/s, 27 tokens)
Output generated in 27.49 seconds (0.18 tokens/s, 5 tokens)
Output generated in 29.19 seconds (0.27 tokens/s, 8 tokens)
Output generated in 28.65 seconds (0.28 tokens/s, 8 tokens)
Output generated in 31.83 seconds (0.53 tokens/s, 17 tokens)
Output generated in 29.37 seconds (0.17 tokens/s, 5 tokens)
Output generated in 29.62 seconds (0.14 tokens/s, 4 tokens)
Output generated in 33.77 seconds (0.44 tokens/s, 15 tokens)
Exception in thread Thread-33 (gentask):
Traceback (most recent call last):
File "D:\TextGen\venv\4bit\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "D:\TextGen\venv\4bit\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\TextGen\text-generation-webui\modules\callbacks.py", line 64, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\TextGen\text-generation-webui\modules\text_generation.py", line 191, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\TextGen\venv\4bit\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "D:\TextGen\venv\4bit\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
return self.sample(
File "D:\TextGen\venv\4bit\lib\site-packages\transformers\generation\utils.py", line 2504, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either inf
, nan
or element < 0
Got my model from https://huggingface.co/decapoda-research/llama-30b-hf-int4/discussions/1#640ea17dade771d6c505c850
top_p = 0
doesn't make sense. Set it to 1 if you don't want to use top_p.
I have tested this, and found if top_p = 0, definitely will result in the "RuntimeError: probability tensor contains either inf, nan or element < 0" error. My own experiences show that this parameter could not be set to 0, even i set it to be top_p = 0.18, the error gone.
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.
you could set p to 0.9999. This shouldn't make much of a difference compared to 1.0 but really small probabilities that might lead to nan, inf etc. are filtered out. Here is more information to the top p sampling method https://huggingface.co/blog/how-to-generate
Maybe it's a good idea to replace top_p=1
with top_p=0.9999
automatically in the web UI? What is the highest threshold before this stops working?
I just saw this discussion while I was looking for a solution to the error "probability tensor contains either inf, nan or element < 0". I solved it in my project by changing p to a bit blow one. So I downloded the web UI and tried to reproduce the problem by also using LLAMA 13B but for me it's working. So was there a special input that provoked the error that I have to use to reproduce it? Or any other spacial settings? Otherwise the solution is more like a workaround.
I ran into same issue for 7B model It got fixed if I turn off the low_resource: in eval_configs/minigpt4_eval.yaml, line 8. low_resource: False But more VRAM is needed. (16GB VRAM)
this is a weird one, I get the same thing with 13B, 7B works... but when I load the same 13B model with just transformers as below it works:
from transformers import AutoTokenizer, LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained('MY_MODEL')
model = LlamaForCausalLM.from_pretrained('MY_MODEL')
prompt = "Hey, are you consciours? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt")
# Generate
generate_ids = model.generate(inputs.input_ids, max_length=30)
response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
also if the model is loaded with a LoRA it also works... very strange.
They are also both using the same conda env.