stable-diffusion-webui Remove activation from final layer of Hypernetworks

From what I understand, a Hypernetwork learns how to nudge the context in Cross Attention mechanisms.

# modules/hypernetworks/hypernetwork.py#L89
def forward(self, x):
    return x + self.linear(x) * self.multiplier

I believe it'd make more sense if the output layer is not activated. For example, if ReLU is chosen as the activation function, then the output from self.linear(x) would only contain non-negative values, which is unnecessarily restrictive. Having said that, I haven't compared the results with or without final activation, so I'd appreciate it if someone could test it out.

Oct 26 '22 04:10 guaneec

I had more success with a linear network than with an activated one, but that could be just my own test.

Oct 26 '22 05:10 differentprogramming

This would change existing trained hypernets using activation functions so it definitely can't be merged as it is.

Oct 26 '22 06:10 AUTOMATIC1111

Fixed. Also rolled back dropout off-by-one fix because that messes up layer names of old nets.

Oct 26 '22 07:10 guaneec

Well, code looks good, so let's wait for someone to test it out.

Oct 26 '22 10:10 AUTOMATIC1111

Its working good, I'm getting proper results with 2500 epoch tests. And yes, last layer activation, especially ReLU, was blocking it - but sometimes it worked, which means its doing some kind of emphasizing?

Another story, I noticed Dropout was not being applied properly in code (especially for 1, 2, 1)... I'll make separate PR to make detailed option for it.

Oct 26 '22 12:10 aria1th

Functionally it works resuming old and training new, completely subjectively it seems to hit a likeness faster than the last run with this dataset, but y'know luck of the draw.

Oct 26 '22 16:10 dfaker

Added missing dropout layer fix by @aria1th

Oct 27 '22 06:10 guaneec

This pull seems to be languishing.

Nov 03 '22 21:11 differentprogramming

This change introduces a bug in training hypernetworks - see attached model data pulled from Torch.

step
25000
name
hn_softreal
layer_structure
[1.0, 2.0, 1.0]
activation_func
relu
is_layer_norm
False
weight_initialization
Normal
use_dropout
True
sd_checkpoint
aaaaaaaa
sd_checkpoint_name
some_checkpoint
activate_output
False
last_layer_dropout
True

step
25000
name
hn_softreal_prototype
layer_structure
[1.0, 2.0, 1.0]
activation_func
relu
is_layer_norm
False
weight_initialization
Normal
use_dropout
True
sd_checkpoint
aaaaaaaa
sd_checkpoint_name
some_checkpoint

I've been trying to figure out why training hypernetworks for style transfer have had zero effect and have been unreliable since this PR merge. The prototype model trained before this PR had a failure rate of about 1 in 10 images with bad quality. The finalised model has a roughly 9 in 10 chance of absolutely ruining the image despite the identical settings.

Both networks were trained with the same dataset and same learning rate (5e-6) with an identical amount of steps.

Easy-ish fix:

Restore classic hypernetwork training alongside the new training method, i.e. allow for total dropout across the model rather than last layer activation. (Why was this hardcoded and not given settings in the Training tabs?)

I've spent the last week trying to debug why training hypernetworks even on linear have been completely faulty and a waste of time.

Nov 09 '22 20:11 Jordach

Hmm, first negative report after the change. How many HNs have you trained? Are you sure it's not just an unlucky run?

This PR actually makes 3 changes:

Remove output activation
Fix missing last dropout layer
Change bias of Normal initialization from std=0.005 to std=0

You mentioned linear HNs being broken. The only reasons for that are 2 and 3 as linear nets don't have activation layers anyway. 2 is an objective bug and should be kept in. I think 3 was made because @aria1th had better results with it?

You can try locally reverting the changes by editing the code here and here. We need more tests to pinpoint the cause of the reported bug, ideally more than 1 run for each of the 8 combinations.

I didn't make checkboxes for the changes because I thought these changes shouldn't negatively impact the HNs and less clutter is better. If evidence does show that there's negative impact, I'll make a new PR with options.

Nov 10 '22 04:11 guaneec

Hmm, first negative report after the change. How many HNs have you trained? Are you sure it's not just an unlucky run?

This PR actually makes 3 changes:

Remove output activation

Fix missing last dropout layer

Change bias of Normal initialization from std=0.005 to std=0

You mentioned linear HNs being broken. The only reasons for that are 2 and 3 as linear nets don't have activation layers anyway. 2 is an objective bug and should be kept in. I think 3 was made because @aria1th had better results with it?

You can try locally reverting the changes by editing the code here and here. We need more tests to pinpoint the cause of the reported bug, ideally more than 1 run for each of the 8 combinations.

I didn't make checkboxes for the changes because I thought these changes shouldn't negatively impact the HNs and less clutter is better. If evidence does show that there's negative impact, I'll make a new PR with options.

I noticed my previously trained style hypernetwork didn't mush faces or finer details since it's from 03/11/22.

The easiest way would be to verify behaviour in current day webui vs pre-PR webui by training the hypernetworks where I've found it to work in the previous version, then create a seed with specific prompt - then use the same trained hypernetwork, seed, checkpoint and prompt in the current version of webui then use difference filtering to find any lurking changes. A black image in difference filtering means it's 1:1 with the original.

This should be the fastest way to verify that older networks are indeed behaving properly.

Nov 10 '22 07:11 Jordach

Just checked it. Created a HN with ReLU on 2cf3d2a run it the same txt2img settings on 2cf3d2a and ac08562. Results are identical, as this PR shouldn't affect previously trained HNs.

You should probably find a way to reliably reproduce the described bug and open an issue.

Nov 10 '22 13:11 guaneec

Just checked it. Created a HN with ReLU on 2cf3d2a run it the same txt2img settings on 2cf3d2a and ac08562. Results are identical, as this PR shouldn't affect previously trained HNs.

You should probably find a way to reliably reproduce the described bug and open an issue.

I'll have to check whether xformers being enabled has any influence on training. Will test with current version of webui then test again to see if it reaches my 1 in 9 failure rate as the original did - meaning that for every eight images where it transferred successfully, one looked very wrong.

If it isn't xformers, then it's something to do with the changes to dropout and last layer outputs not being okay with non linear activations such as SILU or relu.

Update about 90 minutes in (RTX 2080S) at iteration 14000 xformers is not to blame during training.

A common thread and pattern I've observed with both -1 seed prompts and static seed prompts (to test hypernetwork performance with known variables) with the changes to activation and layer dropout's effects is that the hypernetwork begins to start hyper focusing on one specific part (and not all of it) of the style it's trying to transfer.

In this case, it's been overfitting stripes onto objects that naturally do not contain it, such as hair or even remove specified parts of the prompt (clothing, jewellery, etc) even with 1.4x levels of focus. This did not occur in the previous method of training. While it appears to be overfitting with the new training algorithm - it isn't deep frying or causing neural net death earlier than usual.

Update after training completed, random seeds, model still shows signs of distortion and other errors such as misshapen objects, deformed perspective and other abnormalities that do not appear after unloading the test hypernetwork. I'd say it's cooked - but that's (logically) wrong because it's not deep frying. So somewhere and someplace during training, dropout isn't functioning correctly as it seems to over activate on certain features without prompting. While the disabling of the output layer also seems to contribute. It's like the checkpoint model and hypernetwork don't mesh.

Will continue updating this comment as a manner of collating thoughts to write a proper bug report about the changes to Hypernetwork training.

Nov 10 '22 17:11 Jordach

stable-diffusion-webui stable-diffusion-webui copied to clipboard

Remove activation from final layer of Hypernetworks

stable-diffusion-webui
stable-diffusion-webui copied to clipboard