Disable unavailable or duplicate options for Activation functions
Task list
Pending tasks
- [x] Remove last layer activation and fix dropout silently https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/3698
- [x] Re-Add Linear option disappeared by mistake.https://github.com/AUTOMATIC1111/stable-diffusion-webui/pull/3717
Done tasks
- [x] Disable unavailable or duplicate options.https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/462e6ba6675bd14c0f82e465423a0eedfff82372 This MultiHeadAttention is not an standard activation function. Swish is duplicate key of hardswish. But for supporting generated HNs, dict itself won't be mutated.
Future jobs:
-
[ ] Fix Hypernetwork multiplier value while training As far as I read the code, hyperparameter multiplier can be changed while training
-
[ ] Save and load optimizer state dict People complained about optimizer not resuming properly, it was because we don't save optimizer state dict.
-
[ ] Generalized way to save / load optimizers This is for generalizing optimizer resuming process. It does not necessarily mean it will offer more optimizer options immediately.
-
[ ] Also offer an option to nuke optimizer state dict Sometimes you want to nuke optimizer state dict, which will very likely change its training direction.
-
[ ] Add an option for specify standard deviation + scale multiplier for initialization + nonzero bias initialization related - https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/2740 Analyzed data : Colab Shortly, Xavier and Kaiming have too big standard deviation in weight initialization compared to normal. But rather than using magic numbers, the std should be parameterized, and we can use xavier normally, if we scale it. (its called gain in pytorch parameter)
-
[ ] Add an option to fix weight initialization seeds. This is for reproducing results.
-
[ ] Add an option to specify dropout structure. Few examples have shown that 1, 2, 2[Dropout], 1 structure is promising. This is actually bug-generated networks, which won't be able to struct same structure with fix. Instead of totally removing the functionality, we need to offer detailed way to specify dropouts. Example : [0, 0.1, 0.15, 0] -> applies dropout at second, third layer. The sequence should follow the layer structure, First and last value should never use value other than 0.
Optional
-
[ ] Quick-start in page / Offering references of previously trained HNs
-
[ ] Emphasize the importance of dataset quality
-
[ ] Grouping activations by type
-
[ ] Generalized ways to evaluate HNs properly
-
[ ] Hyperparameter tuning pipeline
-
[ ] Add ways to use multiple hypernetworks sequentially or in parallel
This is all great, but as of yesterday, tons of people (myself included) have been complaining about not being able to get any training to work, even on settings that used to work, and people who rolled back to earlier versions found that it did work - so shouldn't we be trying to fix whatever broke first before changing even more?
@enn-nafnlaus Can you specify the setups which were broken? At least I see Linear to be broken due to my mistake (its urgent, sorry, there's existing PR mentioned above) but others were working and someone uploaded mish - based HNs, and some examples exist are suggesting new weight inits are not totally garbage, but just too aggressive and slow mish, checking effect of weight normalizations
@aria1th Start reading the comments in this thread over the past 24 hours:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/discussions/2670#discussioncomment-3969734
Since there are results from mish activation and xavier normal I conclude that there are no missing functions except missing Linear.
There are far more results that proves swish is working at least, and some possibilities for tanh too.
I also very much agree, the hypernetwork I trained after 10/17 still has some progress, but all of them are not as good as the beginning, I have tried swish, relu, elu, mish, they all work very poorly for me, I Changed at least 8 datasets, turned on or off dropout and layerstructure, etc., and even used a lot of learning rates, including high learning rate 1e-3:1000, low learning rate 5e-6 2000, still can't get updates like Excellent results before, including today's 1, 1.5, 1.5, 1 8e-6, I changed 5 datasets and it's still bad, I'm not complaining, just wondering where the code was modified that led to this result . (even linear 1,2,1, without activating LN and DR still can't get the previous effect)
@gzalwa201 The only change I catched related to linear is weird 0.005 from nowhere , which should be fixed with returning of linear functions. And note that even with same sturcture and same LR, the training might fail or success, due to random weight initialization. (I'm planning to implement seed fixing there)
Its commonly misunderstood that linear will anyway give a result - actually no, at least I observe failures and successes equal to activation functions. Also, appropriate learning rate is different for all cases, there are no rule of thumb for that, because sometimes you can even get results from 0.005, sometimes you don't....
Assume that all the guides could have missing stuff, like some guy even did 150k epoch training and got astonishing result, which means there're still no fixed way to do this efficiently.