stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

[Bug]: "no gradient found for the trained weight after backward() for 10 steps in a row" while training a hypernetwork.

Open enn-nafnlaus opened this issue 3 years ago • 11 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues and checked the recent builds/commits

What happened?

AssertionError: no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue

Occurred somewhere after step 39000 while training a linear 1,2,4,2,1 network with dropout turned on. I suspect dropout is the culprit, as (A) it's a new feature, (B) I've never encountered this before without it, and (C) It happened on my very first time using dropout.

Steps to reproduce the problem

Train a 1,2,4,2,1 linear hypernetwork with dropout turned on for a long time.

I'll be trying another test, a 1,2,4,2,1 swish hypernetwork with dropout, to see if the same thing happens.

What should have happened?

The error should not have occurred. If an unrecoverable error occurs, it should automatically roll back to the last good hypernetwork and try to continue. If dropout is to blame, since dropout is probabilistic, the error shouldn't recur at the same spot, since the dropouts are semirandom.

Commit where the problem happens

6bd6154a92eb05c80d66df661a38f8b70cc13729

What platforms do you use to access UI ?

Linux

What browsers do you use to access the UI ?

Google Chrome

Command Line Arguments

--listen --port <myport> --gradio-auth <myauth> --hide-ui-dir-config

Additional information, context and logs

No response

enn-nafnlaus avatar Oct 23 '22 20:10 enn-nafnlaus

Having the same problem.

Traceback (most recent call last): File "E:\SD\a11\stable-diffusion-webui\modules\ui.py", line 223, in f res = list(func(*args, **kwargs)) File "E:\SD\a11\stable-diffusion-webui\webui.py", line 63, in f res = func(*args, **kwargs) File "E:\SD\a11\stable-diffusion-webui\modules\hypernetworks\ui.py", line 47, in train_hypernetwork hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args) File "E:\SD\a11\stable-diffusion-webui\modules\hypernetworks\hypernetwork.py", line 396, in train_hypernetwork assert steps_without_grad < 10, 'no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue' AssertionError: no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue

cvar66 avatar Oct 23 '22 20:10 cvar66

I can confirm this also happened on my end.

512x512 images, Learning rate 0.000005, HN layer structure 1, 2, 1 .

Additionally, when I started to retrain it, I lost several steps (about 7000) of progress, it didn't start exactly where it left off but several steps prior.

Subarasheese avatar Oct 24 '22 00:10 Subarasheese

Everyone who's getting this, do you have dropout enabled? Would be nice to clarify whether that's the issue.

enn-nafnlaus avatar Oct 24 '22 00:10 enn-nafnlaus

I don't have dropout enabled on. I have almost everything on default except learning rate ar 0.00005

cvar66 avatar Oct 24 '22 00:10 cvar66

Everyone who's getting this, do you have dropout enabled? Would be nice to clarify whether that's the issue.

No, it is using the default settings (disabled). I am also using "relu" activation function, if that matters (also a default setting).

Subarasheese avatar Oct 24 '22 00:10 Subarasheese

I can confirm this also happened on my end.

512x512 images, Learning rate 0.000005, HN layer structure 1, 2, 1 .

Additionally, when I started to retrain it, I lost several steps (about 7000) of progress, it didn't start exactly where it left off but several steps prior.

I probably didn't lose it all only because I had interrupted the process before (and it saved the merged Hypernetwork).

My suggestion for the mean time while the root issue isn't fixed is to at least use exception handling in a way that it saves the current step of training to the merged .pt, and automatically retry the failing step while outputting there was an error.

Subarasheese avatar Oct 24 '22 01:10 Subarasheese

OK I seem to have got this working now by starting the webui with no flags.

cvar66 avatar Oct 24 '22 20:10 cvar66

I was having this issue after every log image export, I found removing the --medvram flag solved the issue for me.

dekaikiwi avatar Oct 26 '22 15:10 dekaikiwi

This error happens when medvram flag is on and when it tries to generate preview.

aria1th avatar Oct 27 '22 12:10 aria1th

Is this definitive? I will try without anyway and see if it works!

frosta95 avatar Oct 29 '22 19:10 frosta95

got the same issue, COMMANDLINE_ARGS=--deepdanbooru --medvram, layer normalization and dropout enabled

trojblue avatar Nov 05 '22 05:11 trojblue

I've got an RTX 2060 Super. Not enough to do Dreambooth, but almost enough to do Hypernetworks...I just didn't get good results at 0.00001 and 20K+ steps out the gate. So, I followed the guides for starting at 2000 steps and 0.00005.

Repeated crashes as soon as it generates an image to save. It didn't do that though previously, so I was confused.

Kneejerk reaction was to add medvram......and that brought me here.

However, since I ran the last Hypernetwork training I'd added --deepdanbooru. I removed it and --xformers, and poof, it runs just fine no crashes. So far, so good.

DoughyInTheMiddle avatar Nov 23 '22 05:11 DoughyInTheMiddle