stable-diffusion-webui icon indicating copy to clipboard operation
stable-diffusion-webui copied to clipboard

Add hypernetwork training rate autolearning based on preview image differentials

Open enn-nafnlaus opened this issue 2 years ago • 48 comments

Problem: managing training rates with hypernetworks is a pain.

What humans do: look at the preview image(s) and if they seem to be changing too quickly, lower the learning rate (or vice versa)

What this patch does: automate what humans do.

Usage: Instead of specifying a single number as the learning rate, or a comma-separated list of learning rates and cycle numbers, the user can optionally instead specify the training rate as:

=Step0LearningRate/DesiredImageChangeRate/HalfLife

...where

Step0LearningRate is what it says on the tin - the learning rate it starts out with on step 0.

DesiredImageChangeRate is how much the user would like to see the preview images change with each generation, as a decimal percentage (for example, 0.08 = 8% image difference).

HalfLife is the number of cycles over which DesiredImageChangeRate halves. So for example for =1e-6/0.08/30000, at step 0 the desired change rate would be 8%, at step 30k it would be 4%, at step 60k it'd be 2%, and so forth.

The latter two parameters are optional; defaults are 8% and 30000 cycles, respectively.

Features

Stability: While it does not guarantee no blow-ups, it seems to be more stable and less of a PITA than manual rate specification.

Caution: It is capable of ramping learning rates down quickly, as fast as a literal order of magnitude, in response to rapid image changes. By contrast, ramping up cannot exceed 30% per preview image cycle, and 75% of the value of the new learning rate is based on the old learning rate. Aka, the NN transitioning from one plateau to a next isn't a problem.

Resumption: The user can resume at any point without changing the rate, and it will pick up where it left off. If there is an .optim file, it uses the last rate in the .optim file. If there is none, it makes a pessimistic guess at the rate; it then readjusts up to the desired image change rate over the coming preview cycles.

Annealing: Learning rates fluctuate up and down, usually twofold or so. This adds a small annealing impact to the learning process, which is generally seen as beneficial.

Limitations

Not magic: While it helps resist blowups, it does not prevent them.

  1. If you specify too high of a step 0 learning rate, it can blow up before it even really gets going.

  2. If you generate previews too infrequently, you might go from "everything's running just fine" to "blown up" with no previews in-between. This isn't common in my experience, but if you try to push it too hard it might happen.

  3. If you only generate preview images for one seed, you might not get a good idea of how the model as a whole is changing. Pull request #4343 for allowing one to generate multiple preview images as a grid is useful.

  4. Of course, if you generate images too frequently and too many seeds at once, you'll slow down your generation, so there's a balance to be struck.

  5. It's possible to get a "slow blowup", without any radical movements. This generally happens if you push your luck too far, like going with a half-life of say 80k cycles or whatnot, aka trying to keep the model making large changes for very long periods of time. Basically, the autolearning system will prevent its attempts at quick blowups until the model finally finds a way to pull off a slow blowup that sneaks through.

So to repeat: it helps, but it's not magic. Stick within reasonable bounds and it makes training a more pleasant experience. :)

Future possibilities

I wanted to also implement two auto-rollback systems:

Rollback to the last checkpoint and slow down if there's sudden radical changes in the image. Basically, step off plateaus more gently.

Rollback >= 10k steps and slow down if the loss rate gets too high. Basically, if it's clearly blown up and you're getting loss rates like 0,3 or whatnot, jump way back.

Unfortunately, I can't do this because of the memory leak; you can't restart training without using up VRAM and eventually crashing. That said: if someone finally fixes the memory leak, I'll implement this auto-rollback functionality.

enn-nafnlaus avatar Nov 09 '22 00:11 enn-nafnlaus

Question. I set 5e-5 for InitialLearningRate and started training. ( =5e-5/0.08/30000)

However, when I check the training log

Training at a rate of 5e-06

and it is running at a rate of 5e-06.

Is this the intended behavior?

tsukimiya avatar Nov 09 '22 19:11 tsukimiya

Question. I set 5e-5 for InitialLearningRate and started training. ( =5e-5/0.08/30000)

However, when I check the training log

Training at a rate of 5e-06

and it is running at a rate of 5e-06.

Is this the intended behavior?

Are you sure you're not resuming an earlier training run that was training at 5e-06 when it left off? The initial training rate you specified is only for step 0. When it resumes an older run, it always starts with where that run left off (if there's an optim file), to avoid the risk of a blowup, as mentioned above (if there's no optim file, it makes a pessimistic guess). It's designed so that you don't have to change anything between resumptions - you can interrupt it and restart it at will. That couldn't happen if you had to re-specify the initial rate each time.

It's also not important that it's lower than you intended. After a number of preview cycles it'll figure out where it should be.

If you're sure it wasn't resumed and is typed correctly:

  • Could you post your full output, leading up to that line?
  • Consider adding more print statements, such as printing all the parameters to init in LearnRateScheduler.

I'm going to reword "Initial" to "Step 0" in the initial post, to help avoid confusion.

enn-nafnlaus avatar Nov 09 '22 20:11 enn-nafnlaus

I also put the initial rate at 1e-6 and the first rate the log showed was:

And after the first two images:

Auto-learning selected.
Training at a rate of 1e-07
Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████| 780/780 [00:28<00:00, 27.85it/s]
Mean loss of 390 elements
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.61it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.58it/s]
Image differential=0.03755098581314087, target=0.0772802627879707, lr=1e-07->1.3703031129564737e-07
dataset loss:0.129±(0.008):   1%|▋                                               | 587/40000 [03:22<3:34:32,  3.06it/s]
Total progress:   0%|                                                           | 40/800000 [01:28<32:04:24,  6.93it/s]

Line 87 of learn_schedule.py, you are multiplying it by 0.1 self.learn_rate = 0.1 * self.initial_learn_rate * (self.target_image_differential / self.max_image_differential) ** (1 + (cur_step / self.differential_halflife)) # Be very pessimistic, as we lack an optimizer, so training tends to explode.

Still doing a test run, at least it's increasing the learning rate steadily to match the target change rate.

Heathen avatar Nov 09 '22 21:11 Heathen

I also put the initial rate at 1e-6 and the first rate the log showed was:

And after the first two images:

Auto-learning selected.
Training at a rate of 1e-07
Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████| 780/780 [00:28<00:00, 27.85it/s]
Mean loss of 390 elements
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.61it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.58it/s]
Image differential=0.03755098581314087, target=0.0772802627879707, lr=1e-07->1.3703031129564737e-07
dataset loss:0.129±(0.008):   1%|▋                                               | 587/40000 [03:22<3:34:32,  3.06it/s]
Total progress:   0%|                                                           | 40/800000 [01:28<32:04:24,  6.93it/s]

Line 87 of learn_schedule.py, you are multiplying it by 0.1 self.learn_rate = 0.1 * self.initial_learn_rate * (self.target_image_differential / self.max_image_differential) ** (1 + (cur_step / self.differential_halflife)) # Be very pessimistic, as we lack an optimizer, so training tends to explode.

Still doing a test run, at least it's increasing the learning rate steadily to match the target change rate.

Oh dang, you're right - I haven't started a from-scratch training run since I added in that new functionality. That's the code that runs if there's no optim file, to pick a pessimistic rate to resume at. But obviously when you start a new run, there's no .optim file!

Will fix that and retest this evening. :)

enn-nafnlaus avatar Nov 09 '22 21:11 enn-nafnlaus

Also got this when I interrupted:

Traceback (most recent call last):
  File "\modules\ui.py", line 189, in f
    res = list(func(*args, **kwargs))
  File "\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "\modules\hypernetworks\ui.py", line 50, in train_hypernetwork
    hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args)
  File "\modules\hypernetworks\hypernetwork.py", line 621, in train_hypernetwork
    hypernetwork.eval()
AttributeError: 'Hypernetwork' object has no attribute 'eval'

Heathen avatar Nov 09 '22 21:11 Heathen

Also got this when I interrupted:

Traceback (most recent call last):
  File "\modules\ui.py", line 189, in f
    res = list(func(*args, **kwargs))
  File "\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "\modules\hypernetworks\ui.py", line 50, in train_hypernetwork
    hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args)
  File "\modules\hypernetworks\hypernetwork.py", line 621, in train_hypernetwork
    hypernetwork.eval()
AttributeError: 'Hypernetwork' object has no attribute 'eval'

Will check into it.

enn-nafnlaus avatar Nov 09 '22 21:11 enn-nafnlaus

Are you sure you're not resuming an earlier training run that was training at 5e-06 when it left off? A new hypernetwork named "test-v1" is created. Settings: image

Next I set the "Hypernetwork learning rate" to "=5e-5/0.08/30000" and started training. image

Execution log.

[1.0, 1.5, 1.5, 1.5, 1.0]
Activation function is softsign
Weight initialization is Normal
Layer norm is set to False
Dropout usage is set to False
Activate last layer is set to False
Dropout structure is set to [0.0, 0.05, 0.15, 0.15, 0.0]
Optimizer name is AdamW
No saved optimizer exists in checkpoint

Auto-learning selected.                                                      

Total progress:   0%|               | 2240/1400000 [1:26:33<70:13:07,  5.53it/s]Training at a rate of 5e-06
Preparing dataset...
100%|█████████████████████████████████████████| 224/224 [00:03<00:00, 61.00it/s]

hypernetwork_loss.csv outputs this

step,epoch,epoch_step,loss,learn_rate
1,0,1,0.0000000,5e-06
2,0,2,0.1612930,5e-06
3,0,3,0.3753136,5e-06
4,0,4,0.2555241,5e-06
5,0,5,0.2524809,5e-06
6,0,6,0.2043646,5e-06
7,0,7,0.1715311,5e-06
8,0,8,0.1653065,5e-06
9,0,9,0.1565883,5e-06
20,0,20,0.1138077,5e-06
21,0,21,0.1084432,5e-06
22,0,22,0.1041450,5e-06
23,0,23,0.1019337,5e-06
24,0,24,0.1130633,5e-06
...

tsukimiya avatar Nov 09 '22 21:11 tsukimiya

Are you sure you're not resuming an earlier training run that was training at 5e-06 when it left off? A new hypernetwork named "test-v1" is created. Settings: image

Next I set the "Hypernetwork learning rate" to "=5e-5/0.08/30000" and started training. image

Execution log.

[1.0, 1.5, 1.5, 1.5, 1.0]
Activation function is softsign
Weight initialization is Normal
Layer norm is set to False
Dropout usage is set to False
Activate last layer is set to False
Dropout structure is set to [0.0, 0.05, 0.15, 0.15, 0.0]
Optimizer name is AdamW
No saved optimizer exists in checkpoint

Auto-learning selected.                                                      

Total progress:   0%|               | 2240/1400000 [1:26:33<70:13:07,  5.53it/s]Training at a rate of 5e-06
Preparing dataset...
100%|█████████████████████████████████████████| 224/224 [00:03<00:00, 61.00it/s]

hypernetwork_loss.csv outputs this

step,epoch,epoch_step,loss,learn_rate
1,0,1,0.0000000,5e-06
2,0,2,0.1612930,5e-06
3,0,3,0.3753136,5e-06
4,0,4,0.2555241,5e-06
5,0,5,0.2524809,5e-06
6,0,6,0.2043646,5e-06
7,0,7,0.1715311,5e-06
8,0,8,0.1653065,5e-06
9,0,9,0.1565883,5e-06
20,0,20,0.1138077,5e-06
21,0,21,0.1084432,5e-06
22,0,22,0.1041450,5e-06
23,0,23,0.1019337,5e-06
24,0,24,0.1130633,5e-06
...

Heathen beat you to it :) I already pushed (though haven't yet tested) a fix for that.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/4509#issuecomment-1309379974

enn-nafnlaus avatar Nov 09 '22 21:11 enn-nafnlaus

Also got this when I interrupted:

Traceback (most recent call last):
  File "\modules\ui.py", line 189, in f
    res = list(func(*args, **kwargs))
  File "\webui.py", line 54, in f
    res = func(*args, **kwargs)
  File "\modules\hypernetworks\ui.py", line 50, in train_hypernetwork
    hypernetwork, filename = modules.hypernetworks.hypernetwork.train_hypernetwork(*args)
  File "\modules\hypernetworks\hypernetwork.py", line 621, in train_hypernetwork
    hypernetwork.eval()
AttributeError: 'Hypernetwork' object has no attribute 'eval'

Will check into it.

Seems to have been a merge error when creating this branch; it wasn't in my original code. I've pushed a fix (removed the stray eval() command), though I still need to test it.

enn-nafnlaus avatar Nov 09 '22 21:11 enn-nafnlaus

@enn-nafnlaus Thank you! I will try it later. I appreciate your work very much.

tsukimiya avatar Nov 09 '22 21:11 tsukimiya

@enn-nafnlaus Thank you! I will try it later. I appreciate your work very much.

Just trying to make training at least somewhat of a less painful experience! :)

enn-nafnlaus avatar Nov 09 '22 21:11 enn-nafnlaus

I have confirmed that it is working as intended :)

tsukimiya avatar Nov 09 '22 21:11 tsukimiya

Is it anyway for it to check the image contrast or color saturation? as that would be an indication of over training, maybe make so last 5 images differences value get added into consideration from the youngest to oldest, divide the weight so to make smooth gradient like a dynamic compressor kind of. i also found it decreasing the learning rate like a lot, after like 10k steps its at 5e-28...

razord93 avatar Nov 10 '22 06:11 razord93

I found one problem. If the HalfLife option is set, it does not stop as per the Max steps option.

Example : learning rate: =1e-6/0.08/1000 Max steps: 2000

Log:

Image differential=0.021260927120844524, target=0.010006933874625805, lr=1.0315352612096342e-07->3.489711822977459e-08

dataset loss:0.154±(0.021): : 3043it [17:06,  2.96it/s]30:21<1:47:58,  6.11it/s]

For testing, I am running with a small number of times, but the same problem occurred when I set LR:5e-5/0.08/20000 ,Max steps:40000.

There does not appear to be any other bugs other than this.

tsukimiya avatar Nov 10 '22 06:11 tsukimiya

I found one problem. If the HalfLife option is set, it does not stop as per the Max steps option.

Example : learning rate: =1e-6/0.08/1000 Max steps: 2000

Log:

Image differential=0.021260927120844524, target=0.010006933874625805, lr=1.0315352612096342e-07->3.489711822977459e-08

dataset loss:0.154±(0.021): : 3043it [17:06,  2.96it/s]30:21<1:47:58,  6.11it/s]

For testing, I am running with a small number of times, but the same problem occurred when I set LR:5e-5/0.08/20000 ,Max steps:40000.

There does not appear to be any other bugs other than this.

Oops, that may well be - I'd been tested with very large max_steps, so that I could make sure that the step changes were logical as the steps added up.

Part of the problem with tests taking so long is it limits the number of tests I can do, so thanks you all for helping out with that!

Will check into it and fix it this evening when I'm home from work. :)

enn-nafnlaus avatar Nov 10 '22 10:11 enn-nafnlaus

Is it anyway for it to check the image contrast or color saturation? as that would be an indication of over training, maybe make so last 5 images differences value get added into consideration from the youngest to oldest, divide the weight so to make smooth gradient like a dynamic compressor kind of. i also found it decreasing the learning rate like a lot, after like 10k steps its at 5e-28...

Do you mean changes (deltas) of saturation in HSV space (rather than the current measurement of changes in RGB space), or do you mean looking at the absolute saturation?

All of them are possible, and even more to the point, not difficult to implement (RGB->HSV transformations are simple enough). But I'm not sure how good of a measure of a blowup it would actually be. What do others think?

enn-nafnlaus avatar Nov 10 '22 10:11 enn-nafnlaus

I always noticed when i train like someone with a leather jacket that is red but gradually gets more specular details and become an over saturated contrasted jacket, just a lot more contrast gets added just before it blows. maybe something like taking multiple weights of like saturation, luminance maybe also adding training data as weight to try guiding it towards more correct instead of a mild smoothing filter. I found something about comparing histograms, is it possible to normalize the image and spit out how much the normalized imaged changed by and return that?

razord93 avatar Nov 10 '22 11:11 razord93

Maybe I don't understand what close looks like in latent space, but it seems to me that this idea is nonsense.

Not only are pictures for a given prompt wildly different normally because of noise, but when a hypernetwork is learning it's changing semantic layers as well as output layers, so there should sometimes be a complete difference in content from one query to another.

differentprogramming avatar Nov 10 '22 12:11 differentprogramming

Maybe I don't understand what close looks like in latent space, but it seems to me that this idea is nonsense.

Not only are pictures for a given prompt wildly different normally because of noise, but when a hypernetwork is learning it's changing semantic layers as well as output layers,

Every part of your post is addressed in the first post. Please re-read over it. :)

If you have anything new to add that's not addressed there, let me know! :) And FYI, I use this "nonsense" method (which simply automates what humans already do) in all my training now, with far less pain than manual specification. And you can still use both preexisting methods of learning rate specification, and indeed, they're still default.

Let me in particular highlight these sections in response to your post.

=========================

What humans do: look at the preview image(s) and if they seem to be changing too quickly, lower the learning rate (or vice versa)

What this patch does: automate what humans do. =========================

=========================

Features

Stability: While it does not guarantee no blow-ups, it seems to be more stable and less of a PITA than manual rate specification.

Caution: It is capable of ramping learning rates down quickly, as fast as a literal order of magnitude, in response to rapid image changes. By contrast, ramping up cannot exceed 30% per preview image cycle, and 75% of the value of the new learning rate is based on the old learning rate. Aka, the NN transitioning from one plateau to a next isn't a problem.

Resumption: The user can resume at any point without changing the rate, and it will pick up where it left off. If there is an .optim file, it uses the last rate in the .optim file. If there is none, it makes a pessimistic guess at the rate; it then readjusts up to the desired image change rate over the coming preview cycles.

Annealing: Learning rates fluctuate up and down, usually twofold or so. This adds a small annealing impact to the learning process, which is generally seen as beneficial. =========================

=========================

  1. If you only generate preview images for one seed, you might not get a good idea of how the model as a whole is changing. Pull request Support for generating image grids as previews in hypernetwork training. #4343 for allowing one to generate multiple preview images as a grid is useful. =========================

=========================

So to repeat: it helps, but it's not magic. Stick within reasonable bounds and it makes training a more pleasant experience. :) =========================

enn-nafnlaus avatar Nov 10 '22 15:11 enn-nafnlaus

I found something about comparing histograms, is it possible to normalize the image and spit out how much the normalized imaged changed by and return that?

This evening I'll see about creating some deliberate blowups and printing out RGB and HSV to see if one set of parameters is better correlated :)

enn-nafnlaus avatar Nov 10 '22 15:11 enn-nafnlaus

"What humans do: look at the preview image(s) and if they seem to be changing too quickly, lower the learning rate (or vice versa)"

I'm also not aware of that being what humans do.

differentprogramming avatar Nov 10 '22 18:11 differentprogramming

"What humans do: look at the preview image(s) and if they seem to be changing too quickly, lower the learning rate (or vice versa)"

I'm also not aware of that being what humans do.

How do you do it, if not by looking at the preview images? Surely you're not looking at the preview latents, or even more extreme, the model weights, and making judgment calls by that. Can't do it from loss either (I tried automating that, doesn't work; loss is not a predictor of "will blow up soon", only "it's already blown up").

enn-nafnlaus avatar Nov 10 '22 18:11 enn-nafnlaus

I'm also not aware of that being what humans do.

I'm a human and I do it. image

Heathen avatar Nov 10 '22 18:11 Heathen

Run it till it blows up and see if it gets where you want before it's injured. No? Go back to a checkpoint or start over from scratch with different settings, make up an annealing schedule, try again.

differentprogramming avatar Nov 10 '22 18:11 differentprogramming

I found one problem. If the HalfLife option is set, it does not stop as per the Max steps option.

Example : learning rate: =1e-6/0.08/1000 Max steps: 2000

Log:

Image differential=0.021260927120844524, target=0.010006933874625805, lr=1.0315352612096342e-07->3.489711822977459e-08

dataset loss:0.154±(0.021): : 3043it [17:06,  2.96it/s]30:21<1:47:58,  6.11it/s]

For testing, I am running with a small number of times, but the same problem occurred when I set LR:5e-5/0.08/20000 ,Max steps:40000.

There does not appear to be any other bugs other than this.

Seems to be fixed - go ahead and check it out :)

Will do HSV vs. RGB testing for guidance in a bit by deliberately making a couple runs blow up. :)

enn-nafnlaus avatar Nov 10 '22 18:11 enn-nafnlaus

Run it till it blows up and see if it gets where you want before it's injured. No? Go back to a checkpoint or start over from scratch with different settings, make up an annealing schedule, try again.

  • How do you see if it's blown up, if not visually?
  • This does look to see if there's signs of it "blowing up". It just doesn't wait for it to go all the way.
  • If it's just stepping off a plateau, no harm done, it just ramps back up.
  • You can still do your approach of letting it fully explode with this
  • You don't have to use this; it's not even default
  • I'd automate rolling back too if not for the memory leak
  • This does add some annealing (inherently).

I just don't understand what's your issue with this feature existing. There's a million features in this app that I never use; I'm not mad about their existing if some people like them. I mean, if you like more frequent blowups and restarts, power to you.

enn-nafnlaus avatar Nov 10 '22 18:11 enn-nafnlaus

I don't think you can automate knowing that it's injured. The progression is:

  1. loses coherence,
  2. stops following the prompt,
  3. artifacts,
  4. totally messed up
  5. dead

I don't think you can automatically detect anything before step 4. You can fool yourself that you can, but you can't. And by then it's irretrievable.

differentprogramming avatar Nov 10 '22 18:11 differentprogramming

Actually it may be irretrievable at step 1!

differentprogramming avatar Nov 10 '22 18:11 differentprogramming

Maybe there's a technical way to know when it's dead? It generates NANs?

differentprogramming avatar Nov 10 '22 18:11 differentprogramming

Seems to be fixed - go ahead and check it out :)

I have confirmed that it stops exactly as I set it up in Max steps. Thanks for the fix. :)

I have tested this feature many times and honestly I am not sure of its usefulness yet. There may be more optimal methods. Its confirmation requires more time and computer resources. For this reason, I believe that more people should be able to try this feature.

tsukimiya avatar Nov 10 '22 22:11 tsukimiya