stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: A tensor with all NaNs was produced in Unet

Open GreenTeaBD opened this issue 1 year ago • 207 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

I made a fresh reinstall of automatic1111 today. Normal models work, depth models do not work. They all have the corresponding yaml file and were working on my older, other install of automatic1111.

So when I try to use a depth model I get the error seen in the logs, it tells me to use --no-half to fix it, which, not ideal but I have plenty of vram. If I use --no-half though it still gives me an error, but a different error, also in the logs

Edit: Because the logs mention my gpu may not support half type, my gpu is a 4090

Steps to reproduce the problem

launch webui.bat, img2img, load a depth model, feed it a source image, hit generate, crash

What should have happened?

img2img should have generated an image

Commit where the problem happens

Commit hash: 0f5dbfffd0b7202a48e404d8e74b5cc9a3e5b135

What platforms do you use to access UI ?

Windows

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

--xformers --disable-safe-unpickle

deforum-for-automatic1111-webui, sd_save_intermediate_images, stable-diffusion-webui-Prompt_Generator, ultimate-upscale-for-automatic1111 extensions installed

Additional information, context and logs

no --no-half

0%| | 0/9 [00:00<?, ?it/s] Error completing request Arguments: ('task(z8s2gece94605h3)', 0, 'skscody', '', [], <PIL.Image.Image image mode=RGBA size=1920x1080 at 0x280023099C0>, None, None, None, None, None, None, 20, 0, 4, 0, 1, False, False, 1, 1, 9, 0.4, -1.0, -1.0, 0, 0, 0, False, 512, 512, 0, 0, 32, 0, '', '', 0, False, 'Denoised', 5.0, 0.0, 0.0, False, 'mp4', 2.0, '2', False, 0.0, False, '

    \n
  • CFG Scale should be 2 or lower.
  • \n
\n', True, True, '', '', True, 50, True, 1, 0, False, 4, 1, '

Recommended settings: Sampling Steps: 80-100, Sampler: Euler a, Denoising strength: 0.8

', 128, 8, ['left', 'right', 'up', 'down'], 1, 0.05, 128, 4, 0, ['left', 'right', 'up', 'down'], False, False, False, False, '', '

Will upscale the image by the selected scale factor; use width and height sliders to set tile size

', 64, 0, 2, '', None, '720:576', False, 1, '', 0, '', True, False, False, '

Deforum v0.5-webui-beta

', '

This script is deprecated. Please use the full Deforum extension instead.
\nUpdate instructions:

', '

github.com/deforum-art/deforum-for-automatic1111-webui/blob/automatic1111-webui/README.md

', '

discord.gg/deforum

', '

Will upscale the image depending on the selected target size type

', 512, 8, 32, 64, 0.35, 32, 0, True, 0, False, 8, 0, 0, 2048, 2048, 2) {} Traceback (most recent call last): File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 56, in f res = list(func(*args, **kwargs)) File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 37, in f res = func(*args, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\img2img.py", line 148, in img2img processed = process_images(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 480, in process_images res = process_images_inner(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 609, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 1016, in sample samples = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling return func() File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral denoised = model(x, sigmas[i] * s_in, **extra_args) File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 354, in forward devices.test_for_nans(x_out, "unet") File "I:\stable-diffusion\stable-diffusion-webui\modules\devices.py", line 136, in test_for_nans raise NansException(message) modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try using --no-half commandline argument to fix this.

--no-half 0%| | 0/9 [00:00<?, ?it/s] Error completing request Arguments: ('task(5014z0igs0omk0j)', 0, 'skscody', '', [], <PIL.Image.Image image mode=RGBA size=1920x1080 at 0x203CDE726B0>, None, None, None, None, None, None, 20, 0, 4, 0, 1, False, False, 1, 1, 7, 0.4, -1.0, -1.0, 0, 0, 0, False, 512, 910, 0, 0, 32, 0, '', '', 0, False, 'Denoised', 5.0, 0.0, 0.0, False, 'mp4', 2.0, '2', False, 0.0, False, '

    \n
  • CFG Scale should be 2 or lower.
  • \n
\n', True, True, '', '', True, 50, True, 1, 0, False, 4, 1, '

Recommended settings: Sampling Steps: 80-100, Sampler: Euler a, Denoising strength: 0.8

', 128, 8, ['left', 'right', 'up', 'down'], 1, 0.05, 128, 4, 0, ['left', 'right', 'up', 'down'], False, False, False, False, '', '

Will upscale the image by the selected scale factor; use width and height sliders to set tile size

', 64, 0, 2, '', None, '720:576', False, 1, '', 0, '', True, False, False, '

Deforum v0.5-webui-beta

', '

This script is deprecated. Please use the full Deforum extension instead.
\nUpdate instructions:

', '

github.com/deforum-art/deforum-for-automatic1111-webui/blob/automatic1111-webui/README.md

', '

discord.gg/deforum

', '

Will upscale the image depending on the selected target size type

', 512, 8, 32, 64, 0.35, 32, 0, True, 0, False, 8, 0, 0, 2048, 2048, 2) {} Traceback (most recent call last): File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 56, in f res = list(func(*args, **kwargs)) File "I:\stable-diffusion\stable-diffusion-webui\modules\call_queue.py", line 37, in f res = func(*args, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\img2img.py", line 148, in img2img processed = process_images(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 480, in process_images res = process_images_inner(p) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 609, in process_images_inner samples_ddim = p.sample(conditioning=c, unconditional_conditioning=uc, seeds=seeds, subseeds=subseeds, subseed_strength=p.subseed_strength, prompts=prompts) File "I:\stable-diffusion\stable-diffusion-webui\modules\processing.py", line 1016, in sample samples = self.sampler.sample_img2img(self, self.init_latent, x, conditioning, unconditional_conditioning, image_conditioning=self.image_conditioning) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in sample_img2img samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 447, in launch_sampling return func() File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 518, in samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args={ File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 145, in sample_euler_ancestral denoised = model(x, sigmas[i] * s_in, **extra_args) File "I:\stable-diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "I:\stable-diffusion\stable-diffusion-webui\modules\sd_samplers.py", line 354, in forward devices.test_for_nans(x_out, "unet") File "I:\stable-diffusion\stable-diffusion-webui\modules\devices.py", line 136, in test_for_nans raise NansException(message) modules.devices.NansException: A tensor with all NaNs was produced in Unet.

GreenTeaBD avatar Jan 19 '23 09:01 GreenTeaBD

I was just try ing to figure out why the hell i keep getting this as well

Pedroman1 avatar Jan 19 '23 09:01 Pedroman1

I maybe took too long typing this, I see there was a commit about a half an hour ago that seems, maybe, relevant? Going to go try and see.

edit: It does not :(

GreenTeaBD avatar Jan 19 '23 09:01 GreenTeaBD

It does work with normal 512-depth-ema I just found. Meaning, it might be related to https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6891

These are depth models I trained myself, and they were trained with an extremely high learn rate (it's what works best for what I'm trying to do), but, yeah, like I was saying these models worked in an earlier version of automatic1111

I did a hard reset all the way back to https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/4af3ca5 (the version I had before that was working) and it does work on that one, I'd go through the whole thing to figure out exactly where this breaks but not enough time for about another week, lunar new years going on.

GreenTeaBD avatar Jan 19 '23 10:01 GreenTeaBD

Let me save you some time 9991967f40120b88a1dc925fdf7d747d5e016888 Run with --disable-nan-check but fyi this shouldnt happen normally. when output of a model is all nan's.

mezotaken avatar Jan 19 '23 15:01 mezotaken

I have this issue with the standard sd 2.1 model, using --disable-nan-check removes the error but the output is black

ghost avatar Jan 19 '23 17:01 ghost

Then something else in earlier commit is breaking it, what we see in the error message is just a symptom.

mezotaken avatar Jan 19 '23 18:01 mezotaken

yo guys, i think this is a bug in xformers: https://github.com/facebookresearch/xformers/issues/631

arpowers avatar Jan 19 '23 21:01 arpowers

I just started to get this error on last git pull too, running a model I've been using just fine. I have a 3060 12GB card, which I think supports half precision... Using --no-half gives the error "A tensor with all NaNs was produced in Unet." Using --disable-nan-check allows it to work again, but just produces junk (and sometimes a black image). Something in a recent commit broke it. Going to hunt it down...

Jonseed avatar Jan 19 '23 22:01 Jonseed

This is a hard bug... I've removed xformers, and gone back to previous commits, and I'm still getting junk outputs...

Jonseed avatar Jan 20 '23 00:01 Jonseed

Interesting. Switching to another model, generating an image, and then switching back to the model I want, the error goes away and I get good outputs. So there is something about switching models that makes it work again... (this is on latest b165e34 commit, with xformers on)

Jonseed avatar Jan 20 '23 00:01 Jonseed

@Jonseed which models are you using? and by junk output what do you mean?

the "black" output is the nan issue from xformers, but there is yet an even more dubious bug causing bad output as you mentioned... I suspect with one of the commonly used models, would be useful to know more.

arpowers avatar Jan 20 '23 03:01 arpowers

@arpowers it was Protogen Infinity. I switched to SD1.5-pruned-emaonly and then back to Protogen Infinity, and it worked again (good outputs). I haven't got a black image since. I only got black images with the junk (garbled) outputs.

Jonseed avatar Jan 20 '23 04:01 Jonseed

this bug is driving me crazy. It's on certain models, but the glitch gives you an error, and then everything afterwards is garbled junk. This bug completely bricks these models. how do I revert to a different branch to fix this

opy188 avatar Jan 20 '23 06:01 opy188

@opy188 Install a new webui folder and switch to any older commit after cloning with

git checkout 1234567

and just stick with that version for the type of model you need

ClashSAN avatar Jan 20 '23 06:01 ClashSAN

sorry

ClashSAN avatar Jan 20 '23 06:01 ClashSAN

@ClashSAN I can't seem to git checkout to that different branch. Is there anything else I need to type?

opy188 avatar Jan 20 '23 07:01 opy188

the "1234567" is where you put your chosen commit.

git checkout 4af3ca5393151d61363c30eef4965e694eeac15e

ClashSAN avatar Jan 20 '23 08:01 ClashSAN

Also getting this with protogen 3.4 only

riade3788 avatar Jan 20 '23 08:01 riade3788

I went back several commits, trying half a dozen, and still had problems... Not sure which commit is ok.

Jonseed avatar Jan 20 '23 13:01 Jonseed

@opy188 @riade3788 did you try the trick of switching to another model, and then back to your desired model? Does that fix it for you?

Jonseed avatar Jan 20 '23 13:01 Jonseed

I wonder if the junk garbled output is related to this bug: "Someone discovered a few days ago that merging models can break the position id layer of the text encoder. It gets converted from int64 to a floating point value and then forced back to int for inference which may cause problems due to floating point errors..."

But that wouldn't explain why switching to a different model, and then back to the merged model makes it work fine again...

Jonseed avatar Jan 20 '23 15:01 Jonseed

getting this with 2-1 but was working fine with 2-0

saif-ellafi avatar Jan 20 '23 17:01 saif-ellafi

It seems the issue is with xformers, I can run it without xformeres on any commit from the latest (which is 12 hours ago this time) to last week's.

DearDhruv avatar Jan 20 '23 21:01 DearDhruv

I'm running with xformers just fine, except that I have to switch to a different model and back for Protogen Infinity to generate good outputs.

Jonseed avatar Jan 20 '23 22:01 Jonseed

When I boot up the server, and generate "a cat", I either get the NaN error, or I get this: cat1

Then I switch to another model, and back to Protogen Infinity, and generate "a cat" and get this: cat2

This is with xformers turned on.

Jonseed avatar Jan 20 '23 22:01 Jonseed

Can confirm that as of about a day and a half ago, every third gen I run gives NaN errors. Even using a batch size of 1. Very annoying. Can also confirm it happens regardless of whether or not --xformers is used

swalsh76 avatar Jan 20 '23 23:01 swalsh76

Another curious thing I noticed this morning is that I'm unable to reproduce past images. When I upload a past image with all same gen parameters and send to txt2img to regenerate, the image is somewhat similar but clearly not the same as I generated just a day or two ago. This isn't just minor xformers indeterminist difference either, it's quite different. Not sure if this is related to the same bug...

Jonseed avatar Jan 21 '23 18:01 Jonseed

Here's another interesting thing I've noticed. If I write "a cat" or "_a cat" or "'a cat" or "`a cat" I get junk output. If I write ",a cat" or "&a cat" I get the NaN error. Even if I just change a space, "~a cat" produces junk output, but "~acat" gives NaN error.

So the junk output and the NaN seem to be related somehow, and the specific characters in the prompt affect which you get. Is it this bug where in some merged models the position id layer of the text encoder is broken? And why does switching to another model, and then back to Protogen seem to fix it, and produce good outputs again? (Although I can't seem to reproduce past images...).

Note, after switching models, and then back to Protogen, I can generate with ",a cat" or "&a cat" without a NaN error, so there seems to be a bug in the way the repo is loading models when the server is initialized, which is different than when switching between models.

Jonseed avatar Jan 21 '23 19:01 Jonseed

I get this error on my custom 2.1 models from EveryDream2Trainer, sometimes also on the 2.1 base model. Bisect revealed 0c3feb202c5714abd50d879c1db2cd9a71ce93e3 to be the cause. Seems like disabling the initialization isn't a good idea for certain models.

Last good commit is a0ef416aa769022ce9e97dcc87f88a0ae9e6cc58

ata4 avatar Jan 21 '23 22:01 ata4

@ata4 but a0ef416 is the commit AFTER 0c3feb2 ? if 0c3feb2 is the problem, wouldn't the last good commit be the one before that, 76a21b9 ?

Jonseed avatar Jan 21 '23 22:01 Jonseed