stable-diffusion-webui
stable-diffusion-webui copied to clipboard
[Bug]: Hypernetwork training erratic after #3086 and #3199
Is there an existing issue for this?
- [x] I have searched the existing issues and checked the recent builds/commits
What happened?
After #3086 and #3199(@discus0434) were merged it seems that hypernetwork training is behaving incorrectly. Tested with normalization on and off, 1,2,1 and 1,2,2,1 layer depth. Default relu initialization. Output images from training do not resemble the style of the dataset even after 25000-50000 steps.
Dataset: https://files.catbox.moe/nurv0l.7z
0.75 NAI:WD1.3 model used for all images. Same seed, of course.
Output with no hypernetwork:
With hypernetwork trained on 103 images, 5e-6 learning rate, 35000 steps, depth 1,2,2,1:
Hypernetwork created before update, 53 images, 5e-6 learning rate, 25000 steps:
Steps to reproduce the problem
- Create a hypernetwork with all modules checked, relu initialization, and normalization enabled.
- Train it on your dataset with default settings and a 5e-6 learning rate, I trained to 35k for the example image.
- Little difference in generation when the hypernetwork is applied.
What should have happened?
Hypernetwork should have drastically altered generation results.
Commit where the problem happens
6bd6154a92eb05c80d66df661a38f8b70cc13729
What platforms do you use to access UI ?
Windows
What browsers do you use to access the UI ?
Mozilla Firefox
Command Line Arguments
--deepdanbooru
Additional information, context and logs
No response
From my experience, you need to start with a higher lr than 5e-6 when using normalization and ReLU. I start with 1e-3, and lower the lr when overfitting seems to occur.
I'll give it a shot on current commit with a higher learning rate and see what happens, either way the UI or wiki needs to be updated since the default settings produce almost no results.
Yes, with normalization and relu you need higher learning rates to notice the changes (try enabling dropout, will help to prevent overfitting when using high learning rates). You should still be able to achieve the previous behavior if you choose a simple linear activation, without normalization etc.
Yeah, can people, like, come up with a way so that there's default settings that actually work for the given hypernetwork the user creates, with the default settings for what hypernetwork to create optimizable for either speed or ultimate quality? I've wasted my past week trying to find some way to actually train my model.
As suggested above I tried training at 1e-3(starting rate) with relu, normalization, and dropout enabled. The hypernetwork progress images show change, albeit less quickly than using linear activation, but suddenly start deviating wildly after 3000 or so steps. The loss did not change dramatically around this point, it hovered around 0.07-0.11.
I'm working on 1,2,1 swish+norm+dropout. You do get progress if you start out at a really high rate, and you do indeed need to lower the rate with time.
My steps are around 40k now and I've been in this frustrating situation for a while where I don't know if I'm training too quickly or too slowly. Without normalization it's really obvious when you're training too quickly, but with normalization, not so. When I see weirdness, I don't know if I need to "power through it" with a faster rate, or whether my rate is causing the weirdness and I need to lower it to give it time to settle. Or whether it can't get better, whether I would need more layers for that (though some people say the best results are with just 1 layer!). And each "experiment" to try either increasing or decreasing the rate takes hours, and rarely is the answer clear.
As I stand, my textures are beautiful. Adore them, would stop right away. But the geometry is really messed up.
In case it's useful for anyone, here's what my hypernetwork_loss file looks like. You'll see that at present I'm trying an "up the learning rate to power through weirdness" approach. But it might be a mistake. Oh, and to top it all off, I don't know whether it's still necessary to fully restart SD between training runs, or whether interrupt + resume is now okay - and I've been doing both :Þ
1,1,1,0.0000000,5e-05 501,9,5,0.1091951,5e-05 1001,17,9,0.1137487,5e-05 1501,25,13,0.1046003,5e-05 2001,33,17,0.1166555,5e-05 2501,41,21,0.1033135,5e-05 3001,49,25,0.1013494,5e-05 3501,57,29,0.1045805,5e-05 4001,65,33,0.0944537,5e-05 4501,73,37,0.0924331,5e-05 5001,81,41,0.0915412,5e-05 5501,89,45,0.1162111,5e-05 6001,97,49,0.1065957,5e-05 6501,105,53,0.0985079,5e-05 6501,105,53,0.0000000,0.0002 7001,113,57,0.1087367,0.0002 7501,121,61,0.0989701,0.0006 8001,130,3,0.1024918,0.0006 8501,138,7,0.0953659,0.0006 9001,146,11,0.1059795,0.0006 9501,154,15,0.1139467,0.0006 10001,162,19,0.1174795,0.0006 8501,138,7,0.0000000,0.0001 9001,146,11,0.1077805,0.0001 9501,154,15,0.1119469,0.0001 10001,162,19,0.1025546,0.0001 10501,170,23,0.1146523,0.0001 11001,178,27,0.1015814,0.0001 11501,186,31,0.0992250,0.0001 10501,170,23,0.0000000,2e-05 11001,178,27,0.1062064,2e-05 11501,186,31,0.1106089,2e-05 12001,194,35,0.1007186,6e-05 12501,202,39,0.1027095,6e-05 13001,210,43,0.0961622,6e-05 13501,218,47,0.0994174,6e-05 14001,226,51,0.0928437,6e-05 14501,234,55,0.1020120,6e-05 15001,242,59,0.1103033,6e-05 15501,251,1,0.1049142,6e-05 16001,259,5,0.1093346,6e-05 16501,267,9,0.1002746,6e-05 17001,275,13,0.1124498,6e-05 17501,283,17,0.1002279,6e-05 18001,291,21,0.0978556,6e-05 18501,299,25,0.1011246,6e-05 19001,307,29,0.0907604,6e-05 19501,315,33,0.0893499,6e-05 20001,323,37,0.0879084,6e-05 20501,331,41,0.1117262,6e-05 21001,339,45,0.1020423,6e-05 21501,347,49,0.0944305,6e-05 22001,355,53,0.1007732,0.0001 22501,363,57,0.0989204,0.0001 23001,371,61,0.1113758,0.0001 23501,380,3,0.1002904,0.0001 24001,388,7,0.1124042,0.0001 24501,396,11,0.1111237,0.0001 25001,404,15,0.0951370,0.0001 25501,412,19,0.0981784,0.0001 21501,347,49,0.0000000,1e-05 22001,355,53,0.1038509,1e-05 22501,363,57,0.1081542,1e-05 23001,371,61,0.0990229,1e-05 23501,380,3,0.1112651,1e-05 24001,388,7,0.0990889,1e-05 24501,396,11,0.0967618,1e-05 25001,404,15,0.0999852,1e-05 25501,412,19,0.0896725,1e-05 26001,420,23,0.0883653,1e-05 26501,428,27,0.0868568,1e-05 27001,436,31,0.1102840,1e-05 27501,444,35,0.1007227,1e-05 28001,452,39,0.0931008,1e-05 28501,460,43,0.1076010,1e-05 29001,468,47,0.1001878,1e-05 29501,476,51,0.0984166,1e-05 30001,484,55,0.1109740,1e-05 30501,492,59,0.0970760,2e-05 31001,501,1,0.1079077,2e-05 31501,509,5,0.1080621,2e-05 32001,517,9,0.1004227,2e-05 32501,525,13,0.1031502,2e-05 33001,533,17,0.0993214,2e-05 32001,517,9,0.0000000,8e-06 32501,525,13,0.1025606,8e-06 33001,533,17,0.1067520,8e-06 33501,541,21,0.0978547,8e-06 34001,549,25,0.1101596,8e-06 34501,557,29,0.0982115,8e-06 35001,565,33,0.0958937,8e-06 35501,573,37,0.0990914,8e-06 36001,581,41,0.0887798,8e-06 36501,589,45,0.0877198,8e-06 37001,597,49,0.0861023,8e-06 37501,605,53,0.1270405,3e-05 38001,613,57,0.0990799,3e-05 38501,621,61,0.1096765,3e-05 39001,630,3,0.1070177,3e-05 39501,638,7,0.1071489,3e-05 40001,646,11,0.0995700,3e-05 40501,654,15,0.1044578,5e-05
I might experience the same thing, I managed to create a hypernetwork fine on 20/10, that worked and burned out, so i knew when to lower. And it does what it was intended to do. But now, yesterday and today i've trained for 60.000 steps at different learning rates, tried a lot and it's like nothing much is happening. Every 100 steps the images I get are not showing signs of starting to match the dataset.
So, I'm suspecting that I'm getting into overtraining, and dropout isn't solving the problem. That is, to say, it's getting so good at matching the training images, that it becomes bad at anything except training. I can't guarantee this, but it's my suspicion at this point (either that or that the net is just too small to hold enough detail, and I need a bigger network).
Right now I'm working on dramatically expanding my dataset to see what impact that has. Of course, that's a lot of work, esp. in labeling, but oh well.
I'm going to close this because it seems to be related to an issue with hypernetwork training not resuming properly. I believe that's already been reported so no point having a confusing issue open.