stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: Occasionally getting stuck at end of image generation (VAE)

Open scrumpyman opened this issue 1 year ago • 26 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

I've had this happen on rare occasions, where image generation gets stuck at the end, no error message but GPU is stuck at 100% usage forever. Some of them stopped and output image when I clicked on the Skip button, others nothing could stop it other than closing process. Could be a costly bug if it happens when you leave for a while after running a batch.

Things that might be related: Happened during highres fix at 1920x1080 VRAM was always maxed out (even when it shouldn't have normally been; after restarting webui and generating same image, 8gb/12gb is used during VAE VRAM spike) Live previews were on but set to "Approx NN" mode I tried turning on "Tiled VAE" from extension "multidiffusion upscaler" since it reduces VAE VRAM usage, but it still happened. Happened in batch and single image.

Steps to reproduce the problem

  1. Generate image that uses full VRAM to process VAE
  2. Roll dice
  3. Get stuck at end of generation maybe

What should have happened?

Generation doesn't get stuck, or it gives an OOM error and aborts.

Commit where the problem happens

a9fed7c

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--xformers (happened with --no-half-vae on and off)

List of extensions

2023-03-28 00_58_17-Stable Diffusion

Console logs

Did not think to save logs when it happened, but there was nothing particular, it just freezes. Only thing of note is that when Tiled VAE was on which adds a progress bar for VAE processing in logs, it was stuck halfway.

Will update if it happens again.

Additional information

gpu: 3060 12gb

scrumpyman avatar Mar 28 '23 05:03 scrumpyman

This is normal to some degree, since final VAE processing at the end is extremely GPU load and VRAM intensive, especially at very high resolutions with lots of steps. If you reduce the number of Hires Steps it may help.

If you are not using your GPU headless, you may also be getting TDR display resets/recovery after a period of time for extremely intensive processing, which could explain failing partway through after awhile. These limits can be increased in the registry if that is the cause of some of your generation failures. It won't help at all with your GPU locking up under heavy load though, but increasing the TDR limits will give your GPU more time to complete work when your Display becomes unresponsive.

On my system, I set my TDR limits to 5 minutes, since occasionally I push my GPU extremely hard with compute loads which I'd rather wait for them to complete even with an unresponsive display, but such a high value is overkill for most people. I'll just say that your interpretation of freeze forever is likely untrue if your GPU and drivers are stable, assuming you don't eventually hit OOM or TDR, the jobs should eventually complete even if they take up to 15-20 minutes at the end (note that's the most extreme I've seen when I pushed my GPU too hard, more commonly the last phase completes in under 5-30 seconds). With that said, don't set my aggressive TDR settings unless you are prepared for your desktop to freeze for potentially an extremely long period of time.

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
"TdrDelay"=dword:0000012c
"TdrDdiDelay"=dword:0000012c
"TdrDebugMode"=dword:00000002
"TdrLimitTime"=dword:0000012c
"TdrLimitCount"=dword:0000000a

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\DCI]
"Timeout"=dword:0000012c


The standard recommendation is to initially increase your TDR limits from 2/5 seconds to 60 seconds if you are experiencing this problem. The below settings are consider safe. https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
"TdrDelay"=dword:0000003c
"TdrDdiDelay"=dword:0000003c

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\DCI]
"Timeout"=dword:0000003c


Actual OOM is a separate issue though, the only thing which has an significantly influence on the final VAE stage are the various Attention methods in webui and not using any of the no half or upcast settings which some require to avoid NANs. Disabling Live Previews should also reduce peak VRAM slightly, but likely not enough to make a difference.

Cyberbeing avatar Mar 28 '23 08:03 Cyberbeing

My display was not frozen when this happened, to clarify. The longest I've left it stuck was about 5 minutes.

I'd rather not mess with registry, but I guess if it keeps happening and no issue is found here I'll try that.

Another thing of note, this didn't happen for the first 2 months that I've used this webui and I think it only started after I installed WSL and textgen webui. I don't see how that could affect it though, I don't run both at the same time or anything. More likely I've just been doing highres more often recently.

scrumpyman avatar Mar 28 '23 19:03 scrumpyman

im having the same issue, image generation sometimes gets stuck at 100% and I have to restart the webui every time that happens.

CrazyKrow avatar Mar 28 '23 19:03 CrazyKrow

Gets stuck at 71% in the front end and 100% in the backend for me

space-nuko avatar Mar 28 '23 19:03 space-nuko

have the same issue after new commits of the last 2-3days. The generation just get stuck at the end of an image generation, so you have to close/kill the webui and restart. Happens quite often.

@space-nuko same.

The console shows 100% completion of the gen

Weights loaded in 1.8s (load weights from disk: 0.2s, apply weights to model: 0.7s, load VAE: 0.5s, move model to device: 0.4s).
100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:08<00:00,  4.79it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 40/40 [00:07<00:00,  5.76it/s]

Gradio shows 80%

@Cyberbeing Its not the issue.

neojam avatar Mar 30 '23 12:03 neojam

have the same issue after new commits of the last 2-3days. The generation just get stuck at the end of an image generation, so you have to close/kill the webui and restart. Happens quite often.

@space-nuko same.

The console shows 100% completion of the gen

Weights loaded in 1.8s (load weights from disk: 0.2s, apply weights to model: 0.7s, load VAE: 0.5s, move model to device: 0.4s).
100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:08<00:00,  4.79it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 40/40 [00:07<00:00,  5.76it/s]

Gradio shows 80%

@Cyberbeing Its not the issue.

Having the same issue as well. I can get about three rounds of generations done as the interface gets slower and slower until it finally stops responding.

bparrott78 avatar Mar 31 '23 19:03 bparrott78

You guys still have the problem? It's gone on my end. I think the problem was extension related issue. After updating them all yesterday i dont have the prob anymore. Also i had problem sending images png2txt (infinite loading). It fixed now as well

EDIT: Nope, its still there, but happening not so often as before. Getting stuck on txt2img and 100% GPU usage. Only kill+restart of webui helps

neojam avatar Apr 05 '23 17:04 neojam

This is happening to me even just in img2img. I can generate one image, but then it will get stuck at the end on the next attempt. GPU at 100% and I have to close and restart the system. I am not using a VAE or anything special, just a simple img2img. This also happens sometimes when I am trying to inpaint.

gibsonfan2332 avatar Apr 13 '23 19:04 gibsonfan2332

same here,especially happen when I enable controlnet plugin.

darcula1993 avatar Apr 24 '23 08:04 darcula1993

Same issue here, using a 3060 / 12 GB I reverted to this commit: a9eab236d7e8afa4d6205127904a385b2c43bb24

Now at least I can generate images again

andypotato avatar May 03 '23 03:05 andypotato

Same problem, 100% gpu usage and slows down, problem did not exist before and now I don't know what triggered it.

DeonHolo avatar May 07 '23 19:05 DeonHolo

I solved the problem by reducing the number of previews generated. Instead of setting it to "every 3 steps" I slightly increased the value to "every 5 steps".

I haven't had any issues since.

andypotato avatar May 07 '23 23:05 andypotato

I solved the problem by reducing the number of previews generated. Instead of setting it to "every 3 steps" I slightly increased the value to "every 5 steps".

I haven't had any issues since.

Did not help. I had mine set to 10 anyway. The issue is getting really annoying...

neojam avatar May 21 '23 09:05 neojam

The issue is getting really annoying...

Same, I even downgraded my GPU because I thought it was the problem. Turns out it wasn't...

DeonHolo avatar May 22 '23 12:05 DeonHolo

Same issue here, started randomly 2 days ago after pulling the newest release..

NotPulkz avatar May 25 '23 21:05 NotPulkz

Same here, downloaded newest release about 2 days ago (and today) and this keeps happening. Seems completely random, sometimes on the most easy to generate image. Never had any issue like this prior to a few days ago. It just hangs. I can still use my screen, do other things, just the cmd line shows it doing nothing it will sit at either 100 or 96% and never finishes. Left it for 15 mins and nothing one time. It's not a lock up as I can still use programs like gimp webui just stops responding basically and needs to be restarted.

JeffreyBull76 avatar May 28 '23 01:05 JeffreyBull76

Can confirm, this issue happened to me since the "1.2" update and official transition to torch 2, it did not exist when I was using the march release and manually update to Torch 2 myself. The most recent 1.3 pull seems to have made this worse, I used for about 2 hours today it will randomly get stuck about every 10 generations.

VantomPayne avatar May 28 '23 11:05 VantomPayne

Still getting this issue has anyone figured it out yet? Was going to try reverting to an old torch install but it seems a lot of hassle for possibly no outcome.

JeffreyBull76 avatar May 29 '23 01:05 JeffreyBull76

https://old.reddit.com/r/StableDiffusion/comments/13hm94c/webui_suddenly_stallingnot_finishing/

I did this and it seems to be gone now.

DeonHolo avatar May 29 '23 09:05 DeonHolo

Same problem here, generate is fast, but stuck on 100% on imgtoimg image

it takes 4min AFTER the generate is 100% to make the image !! image

brunobpsrpg avatar Jun 22 '23 01:06 brunobpsrpg

In my case, it seems token merging could cause the stuck to happen a lot more. By setting token merging to 0, the vae stuck issues has been greatly mitigated on my PC.

P.S. I've aslo turned off live preview, but it didn't seem to help, so I turned of token merging as well. There is a possibility that this combination would be necessary.

BlackRice avatar Jun 24 '23 15:06 BlackRice

Upgraded two days ago, and now it happens every time.

mikesimone avatar Jul 26 '23 15:07 mikesimone

For some reason when progress says 100% and holds like this, i click into the CMD window and press Enter. And it magically unstuck. Looks weird

ArcticNoise avatar Oct 09 '23 15:10 ArcticNoise

I found out something "important" about this case that still exists at 2024 - even on my RTX4090 with 24 GB.

The issue occours in cases where you "steal" Graphics Card memory by openening a graphics programme like "Affinity" and opening a bigger pic, while A1111 currently running with squeezing out of it as much as goes. At an example of mine it happend when doing a t2i with the following stats: 1024x512 pic with 40 steps and highresfix x 2.5, 60 steps, Denoise 0.42, 4xUniversalUpscalerV2-Sharp

When A1111 hangs at 100%: If you then give free the memory by closing the "Affinity Photo 2" --> the process immediately finishes.

I guess it has something to do with A1111 has not that memory it expects to have at the start of the process when finishing the process.

Mark-Reiser avatar Mar 13 '24 21:03 Mark-Reiser

I solved this bug by activating "Disable prompt token counters" in settings.

I noticed the generation freeze when vram is full and the orange arrows icon replace the token counter (because you typed something in prompt during image generation).

Thorgrimar avatar Mar 23 '24 22:03 Thorgrimar

Same problem doing Batch generation with prompts from file or textbook. I do a bulk of generations for my work, but stops suddenly and randomly without error messaging..

I will try the token setting, I hope the problem will gone

tanuki-create avatar Mar 24 '24 13:03 tanuki-create