stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: "no gradient found for the trained weight after backward() for 10 steps in a row" while training a hypernetwork.
Is there an existing issue for this?
- [X] I have searched the existing issues and checked the recent builds/commits
What happened?
AssertionError: no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue
Occurred somewhere after step 39000 while training a linear 1,2,4,2,1 network with dropout turned on. I suspect dropout is the culprit, as (A) it's a new feature, (B) I've never encountered this before without it, and (C) It happened on my very first time using dropout.
Steps to reproduce the problem
Train a 1,2,4,2,1 linear hypernetwork with dropout turned on for a long time.
I'll be trying another test, a 1,2,4,2,1 swish hypernetwork with dropout, to see if the same thing happens.
What should have happened?
The error should not have occurred. If an unrecoverable error occurs, it should automatically roll back to the last good hypernetwork and try to continue. If dropout is to blame, since dropout is probabilistic, the error shouldn't recur at the same spot, since the dropouts are semirandom.
Commit where the problem happens
6bd6154a92eb05c80d66df661a38f8b70cc13729
What platforms do you use to access UI ?
Linux
What browsers do you use to access the UI ?
Google Chrome
Command Line Arguments
--listen --port <myport> --gradio-auth <myauth> --hide-ui-dir-config
Additional information, context and logs
No response
Having the same problem.
Traceback (most recent call last): File "E:\SD\a11\stable-diffusion-webui\modules\ui.py", line 223, in f res = list(func(*args, **kwargs)) File "E:\SD\a11\stable-diffusion-webui\webui.py", line 63, in f res = func(*args, **kwargs) File "E:\SD\a11\stable-diffusion-webui\modules\hypernetworks\ui.py", line 47, in train_hypernetwork hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args) File "E:\SD\a11\stable-diffusion-webui\modules\hypernetworks\hypernetwork.py", line 396, in train_hypernetwork assert steps_without_grad < 10, 'no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue' AssertionError: no gradient found for the trained weight after backward() for 10 steps in a row; this is a bug; training cannot continue
I can confirm this also happened on my end.
512x512 images, Learning rate 0.000005, HN layer structure 1, 2, 1 .
Additionally, when I started to retrain it, I lost several steps (about 7000) of progress, it didn't start exactly where it left off but several steps prior.
Everyone who's getting this, do you have dropout enabled? Would be nice to clarify whether that's the issue.
I don't have dropout enabled on. I have almost everything on default except learning rate ar 0.00005
Everyone who's getting this, do you have dropout enabled? Would be nice to clarify whether that's the issue.
No, it is using the default settings (disabled). I am also using "relu" activation function, if that matters (also a default setting).
I can confirm this also happened on my end.
512x512 images, Learning rate 0.000005, HN layer structure 1, 2, 1 .
Additionally, when I started to retrain it, I lost several steps (about 7000) of progress, it didn't start exactly where it left off but several steps prior.
I probably didn't lose it all only because I had interrupted the process before (and it saved the merged Hypernetwork).
My suggestion for the mean time while the root issue isn't fixed is to at least use exception handling in a way that it saves the current step of training to the merged .pt, and automatically retry the failing step while outputting there was an error.
OK I seem to have got this working now by starting the webui with no flags.
I was having this issue after every log image export, I found removing the --medvram flag solved the issue for me.
This error happens when medvram flag is on and when it tries to generate preview.
Is this definitive? I will try without anyway and see if it works!
got the same issue, COMMANDLINE_ARGS=--deepdanbooru --medvram, layer normalization and dropout enabled
I've got an RTX 2060 Super. Not enough to do Dreambooth, but almost enough to do Hypernetworks...I just didn't get good results at 0.00001 and 20K+ steps out the gate. So, I followed the guides for starting at 2000 steps and 0.00005.
Repeated crashes as soon as it generates an image to save. It didn't do that though previously, so I was confused.
Kneejerk reaction was to add medvram......and that brought me here.
However, since I ran the last Hypernetwork training I'd added --deepdanbooru. I removed it and --xformers, and poof, it runs just fine no crashes. So far, so good.