stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

Equalized learning rate doesn't seem to be implemented

Open bob80333 opened this issue 4 years ago • 19 comments

It's mentioned briefly in stylegan2 paper's appendix b, but it is described in Progressive Growing of GANs in detail in section 4.1.

It's done for both conv and linear layers, I found another stylegan2 implementation doing this here.

This might help explain part of the FID gap between this implementation and the official implementation, I'd be happy to do a test run on FFHQ thumbnails to see if it helps.

I have some test results for the latest of this repo vs the official code as far as FID goes, which is what prompted this issue. Training data used was FFHQ thumbnails (128x128) I have trained the official stylegan2 for 1800kimg (~56k iterations at batch size 32), and got a final FID score of ~11.8 I trained this repo for 50k iterations and got a final FID score of ~43.0 (using torch-fidelity).

bob80333 avatar Aug 16 '20 01:08 bob80333

@bob80333 I tried it, but it didn't really make much of a difference. However, I did notice I omitted a normalization at the very beginning of the mapping network, and that did seem to affect training. I've added it to the latest version. Do you want to try it again?

lucidrains avatar Aug 16 '20 18:08 lucidrains

Started training of the FFHQ-thumbnails for 50k steps with the latest version (0.18.6), it should be done sometime tonight.

bob80333 avatar Aug 16 '20 19:08 bob80333

Unfortunately the FID is worse, final FID of ~54.0, so it looks like this was a negative change.

bob80333 avatar Aug 17 '20 01:08 bob80333

how big of a batch size are you doing?

lucidrains avatar Aug 17 '20 04:08 lucidrains

Batch size 32, (batch 16 with gradient accumulate = 2). I quickly checked in colab, but it seems that pytorch functional's normalize produces different results than the pixelwise norm used in Stylegan2, I got an absolute average error of ~.76 on a random vector with shape (1, 512).

Colab

bob80333 avatar Aug 17 '20 14:08 bob80333

https://github.com/NVlabs/stylegan2/blob/master/training/networks_stylegan2.py#L285 I get the same value when I try it

F.normalize(x, dim=-1) # 1
x * torch.rsqrt((x ** 2).sum(dim=-1, keepdims=True) + 1e-8) # 2

lucidrains avatar Aug 17 '20 18:08 lucidrains

@bob80333 I'm not too sure what is going on. I know people in the past have trained with some great results on FFHQ, but perhaps I've tacked on so many additions I introduced a bug somewhere. It would probably be helpful for you to rule out that the equalized learning rate is the cause, so I've provided an interface for you to experiment

https://github.com/lucidrains/stylegan2-pytorch/releases/tag/0.18.7

just use

$ stylegan2_pytorch --data ./data --lr-mul 0.01

to replicate what the official repo has. I have a run with 0.1 ongoing, to see if i overlooked something important

Thanks 'Bob'!

lucidrains avatar Aug 17 '20 19:08 lucidrains

@bob80333 hey Bob, i retried it at 0.1 and it does seem to be doing something. I'll leave it to you to explore that hyperparameter. Please let me know if you find the ideal value, if anything else other than 1 yields an improvement. Please upgrade to 0.19.1. I apologize for dismissing your issue some time ago about the equalized learning rate!

lucidrains avatar Aug 18 '20 01:08 lucidrains

That's great to hear.

However, I do want to clear up some things I think are getting confused together. The equalized learning rate from progressive growing of gans is more about the initialization / runtime scaling of the weights, which is different from my previous issue on scaling the learning rate of the StyleVectorizer.

It looks like to me the changes you are talking about is the scaling the learning rate of the StyleVectorizer/Mapping Network. From what I can see, your earlier commit added part of this runtime scaling of the weights as well as the lr reduction for the StyleVectorizer. I think the runtime scaling/init is probably more important, which may be why your run with part of this runtime scaling/init found an improvement you didn't see with just scaling the StyleVectorizer lr.

The equalized learning rate involves:

  1. initializing all weights (linear and conv) from regular normal distribution, no fancy init
  2. Scaling all weights by the per-layer normalization constant from the Kaiming He initialization.

Here's the quote from the Progressive Growing of GANs paper:

4.1 EQUALIZED LEARNING RATE We deviate from the current trend of careful weight initialization, and instead use a trivial N (0, 1) initialization and then explicitly scale the weights at runtime. To be precise, we set wˆi = wi/c, where wi are the weights and c is the per-layer normalization constant from He’s initializer (He et al., 2015). The benefit of doing this dynamically instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time. Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights. A similar reasoning was independently used by van Laarhoven (2017).

This is from before StyleGAN1, and is applied to all layers as part of their normalization in both the generator and the discriminator.

It is described in section 4 of this paper: link

Hopefully this helps make the difference more obvious, I wanted to see if this equalized learning rate made a difference, not if lowering the lr for the StyleVectorizer was effective.

bob80333 avatar Aug 18 '20 01:08 bob80333

@bob80333 ohh gotcha, thank you for clearing that up. indeed, what seems to have made a difference is scaling the learning rate of the mapping network, even though it wasn't highly stressed in the paper. could you try a run on your end. i'm seeing better results at --lr-mul 0.1

lucidrains avatar Aug 18 '20 03:08 lucidrains

Starting a training run on version 0.19.1 with --lr-mlp 0.01, since that's the value used in the paper. After this, I'd be interested in testing the equalized learning rate (normal distribution init + kaiming layer constant scaling of weights during runtime).

bob80333 avatar Aug 18 '20 03:08 bob80333

@bob80333 sounds good! i'm open to exploring initialization if the mapping network learning rate change doesn't fix things

lucidrains avatar Aug 18 '20 04:08 lucidrains

Complete mode collapse, every image looks nearly identical only 10k steps in.

EMA sample from 10k steps: image

Also, about the pixel-wise normalization from earlier: The tensorflow original code doesn't sum the squared values, it takes the mean of the values.

x * torch.rsqrt((x ** 2).sum(dim=-1, keepdims=True) + 1e-8) # 2 should be x * torch.rsqrt((x ** 2).mean(dim=-1, keepdims=True) + 1e-8) # 2

bob80333 avatar Aug 18 '20 05:08 bob80333

@bob80333 oh my, yes, you are right about it being a mean and not a sum, making the change now

lucidrains avatar Aug 18 '20 05:08 lucidrains

https://github.com/lucidrains/stylegan2-pytorch/releases/tag/0.19.2

lucidrains avatar Aug 18 '20 05:08 lucidrains

@bob80333 i think fixing the normalization means that my recommended --lr-mul is now some other value. let me do another experiment and get back to you. it did make a huge difference

I think you should probably try updating and running it again at --lr-mul 0.01

lucidrains avatar Aug 18 '20 05:08 lucidrains

@bob80333 so I decided to actually revert the change for now, because I think I accidentally stumbled into a winning combination of hyperparameters. could you upgrade to 0.19.3 and run it as is? I've defaulted to the --lr-mul to 0.1, which i've got really good results with. if you still cannot reproduce it, I'm ok with continuing to try to get the little details to be exactly the same as the official repo

lucidrains avatar Aug 18 '20 07:08 lucidrains

@lucidrains Run is complete. FID is slightly worse.

My run of 50k steps, batch size 32, --lr-mlp 0.1 v0.19.3 ended up with an FID score just above v0.18.5.

v0.19.3: FID 43.7

v0.18.5: FID 43.0

This is probably within range of random runs.

bob80333 avatar Aug 19 '20 02:08 bob80333

I have created a StyleGAN2 implementation inspired by this repository. On my experience, equalized learning rate is incredibly important for high resolution training (256 and above).

Erroler avatar May 23 '21 10:05 Erroler