stable-diffusion-webui [Bug]: "no gradient found for the trained weight after backward() for 10 steps in a row" while training a hypernetwork.

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

AssertionError: no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue

Occurred somewhere after step 39000 while training a linear 1,2,4,2,1 network with dropout turned on. I suspect dropout is the culprit, as (A) it's a new feature, (B) I've never encountered this before without it, and (C) It happened on my very first time using dropout.

Steps to reproduce the problem

Train a 1,2,4,2,1 linear hypernetwork with dropout turned on for a long time.

I'll be trying another test, a 1,2,4,2,1 swish hypernetwork with dropout, to see if the same thing happens.

What should have happened?

The error should not have occurred. If an unrecoverable error occurs, it should automatically roll back to the last good hypernetwork and try to continue. If dropout is to blame, since dropout is probabilistic, the error shouldn't recur at the same spot, since the dropouts are semirandom.

Commit where the problem happens

6bd6154a92eb05c80d66df661a38f8b70cc13729

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--listen --port <myport> --gradio-auth <myauth> --hide-ui-dir-config

Additional information, context and logs

No response

Oct 23 '22 20:10 enn-nafnlaus

Having the same problem.

Traceback (most recent call last): File "E:\SD\a11\stable-diffusion-webui\modules\ui.py", line 223, in f res = list(func(*args, **kwargs)) File "E:\SD\a11\stable-diffusion-webui\webui.py", line 63, in f res = func(*args, **kwargs) File "E:\SD\a11\stable-diffusion-webui\modules\hypernetworks\ui.py", line 47, in train_hypernetwork hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args) File "E:\SD\a11\stable-diffusion-webui\modules\hypernetworks\hypernetwork.py", line 396, in train_hypernetwork assert steps_without_grad < 10, 'no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue' AssertionError: no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue

Oct 23 '22 20:10 cvar66

I can confirm this also happened on my end.

512x512 images, Learning rate 0.000005, HN layer structure 1, 2, 1 .

Additionally, when I started to retrain it, I lost several steps (about 7000) of progress, it didn't start exactly where it left off but several steps prior.

Oct 24 '22 00:10 Subarasheese

Everyone who's getting this, do you have dropout enabled? Would be nice to clarify whether that's the issue.

Oct 24 '22 00:10 enn-nafnlaus

I don't have dropout enabled on. I have almost everything on default except learning rate ar 0.00005

Oct 24 '22 00:10 cvar66

Everyone who's getting this, do you have dropout enabled? Would be nice to clarify whether that's the issue.

No, it is using the default settings (disabled). I am also using "relu" activation function, if that matters (also a default setting).

Oct 24 '22 00:10 Subarasheese

I can confirm this also happened on my end.

512x512 images, Learning rate 0.000005, HN layer structure 1, 2, 1 .

Additionally, when I started to retrain it, I lost several steps (about 7000) of progress, it didn't start exactly where it left off but several steps prior.

I probably didn't lose it all only because I had interrupted the process before (and it saved the merged Hypernetwork).

My suggestion for the mean time while the root issue isn't fixed is to at least use exception handling in a way that it saves the current step of training to the merged .pt, and automatically retry the failing step while outputting there was an error.

Oct 24 '22 01:10 Subarasheese

OK I seem to have got this working now by starting the webui with no flags.

Oct 24 '22 20:10 cvar66

I was having this issue after every log image export, I found removing the --medvram flag solved the issue for me.

Oct 26 '22 15:10 dekaikiwi

This error happens when medvram flag is on and when it tries to generate preview.

Oct 27 '22 12:10 aria1th

Is this definitive? I will try without anyway and see if it works!

Oct 29 '22 19:10 frosta95

got the same issue, COMMANDLINE_ARGS=--deepdanbooru --medvram, layer normalization and dropout enabled

Nov 05 '22 05:11 trojblue

I've got an RTX 2060 Super. Not enough to do Dreambooth, but almost enough to do Hypernetworks...I just didn't get good results at 0.00001 and 20K+ steps out the gate. So, I followed the guides for starting at 2000 steps and 0.00005.

Repeated crashes as soon as it generates an image to save. It didn't do that though previously, so I was confused.

Kneejerk reaction was to add medvram......and that brought me here.

However, since I ran the last Hypernetwork training I'd added --deepdanbooru. I removed it and --xformers, and poof, it runs just fine no crashes. So far, so good.

Nov 23 '22 05:11 DoughyInTheMiddle

stable-diffusion-webui stable-diffusion-webui copied to clipboard

[Bug]: "no gradient found for the trained weight after backward() for 10 steps in a row" while training a hypernetwork.

Is there an existing issue for this?

What happened?

Steps to reproduce the problem

What should have happened?

Commit where the problem happens

What platforms do you use to access UI ?

What browsers do you use to access the UI ?

Command Line Arguments

Additional information, context and logs

stable-diffusion-webui
stable-diffusion-webui copied to clipboard