scaling-up-stylegan2 Attention Layer

Hey l4rz, thanks for the extensive research on this topic.

Have you considered adding attention layer(s) instead of increasing the capacity to achieve higher quality? E.g. lucidrains (https://github.com/lucidrains/stylegan2-pytorch) claims this greatly improves results.

However, adding attention to the XXL model will probably yield OOMs. It would be interesting to see what benefits results more, more convolutional filters or attention. Also we know that for harder tasks (more classes and poses), e.g. imagenet, StyleGANs fail at modeling larger structures. We also know, BigGAN which employs attention on scale 64^2 (correct me if I'm wrong) does a far better job here.

I would just like to know what your experiences and thoughts are here.

May 18 '21 12:05 jwb95

Thank you!

I've experimented with adding SAGAN-style self attn layers to SG2 at lower resolution blocks (16..64x) of D and G networks. While training with concrete panel buildings dataset, I was able to get somewhat better results even with lower capacity networks (https://twitter.com/l4rz/status/1343690951240392704) on a small (N=2K) dataset. Inter alia, self attention blocks seem to help SG2 to draw straight lines.

Frankly, I don't think that it is worth that additional VRAM overhead. I think that the perceived problem of "SG2 struggles with ImageNet" is mainly due to the fact that ImageNet is a small dataset, should we consider the number of images per class. ImageNet is ~1M but only ~1K per class and that's clearly not sufficient enough for GANs to work (even with bells and whistles like ADA).

The beauty of SG2 is that it gives high quality results with relatively low compute spent on training (in comparision with BIGGAN or VAEs). In addition, its G network is quite powerful when we start to operate in 𝑊+ space, being able to synthesize samples that are far away from the distribution present in training set.

In addition to my experiments, Aydao's success with TANDE (D and G are slightly larger than my XXL model; also the latent dimensionality increased 2x) demonstrates that scaled up SG2 is indeed able to reproduce a variety of modes. Exploration of its latent space: https://twitter.com/l4rz/status/1376938909997924355

So I think it would be insteresting to try to scale the SG2 up even more, also in terms of depth (adding more conv layers), and train it on a really large, perplexed, multi-modal dataset like Danbooru anime figures. Maybe wrapping up the SG2-ADA-PyTorch with deepspeed or something.

It could be also interesting if someone could shed some light on what's really going on in these skip connections of SG2 G.

(tbh i don't think anyone knows what's going on with GANs at all)

May 19 '21 10:05 l4rz

Thanks for the insights and great thoughts!

"I think that the perceived problem of 'SG2 struggles with ImageNet' is mainly due to the fact that ImageNet is a small dataset" - great point.

The Aydao-results look very impressive. While mid-to-high details look perfect, geometric failures in the larger structures are still noticable. So there is still need for even more capacity and/or more data.

Why would increasing depth be better than simply increasing the number of filters? The receptive fields of all the blocks (=5) should be sufficient, covering the entire image. Simply adding more filters should do the job (?).

"It could be also interesting if someone could shed some light on what's really going on in these skip connections of SG2 G." I agree. I speculate this helps to somewhat greedily speed up convergence by being able to send learning signal to different resolutions, independently on what is going on in the other blocks. For example, if a detail is correctly rendered by a higher-res-block, but the overall composition (represented in low res-rgb-space) is not fooling D, then the respective G low-res-block gets a penalty. Still this doesn't tell G directly how to match up details with larger structures properly, which is why I guess it's "greedy".

"(tbh i don't think anyone knows what's going on with GANs at all)" - :D Well... I don't know either. How I imagine it is, that D memorizes features of the dataset (on all resolutions), while G learns to reproduce them based on the input-noise, such that some latentpoints are mapped to features arranged in such a way, that they make up a convincing image. And hopefully, this image is not part of the dataset. That's when it worked. However, most latentpoints will still be mapped to features arranged in such a way, that errors in the image occur - that's where the research is at. Since we can see that SG2s G is very expressive and general (possible to produce almost any image by optimising the 𝑊+ space-latent), I speculate that the problem at the core lies in the capacity of the Discriminator and maybe in the mapping network of G. Simply increasing filters as you stated should get us there.

By the way: The new papers about Diffusion Models look very promising. (https://arxiv.org/abs/2105.05233)

May 29 '21 07:05 jwb95

scaling-up-stylegan2 scaling-up-stylegan2 copied to clipboard

Attention Layer

scaling-up-stylegan2
scaling-up-stylegan2 copied to clipboard