swapping-autoencoder-pytorch icon indicating copy to clipboard operation
swapping-autoencoder-pytorch copied to clipboard

Questions regarding some design choices

Open hubert0527 opened this issue 3 years ago • 3 comments

Really impressive work and high-quality code release! I found several intriguing design choices while digging into the codebase, and looking for some clarifications or explanations of them:

  1. Blur in the encoder
    The encoder architecture seems partially borrow the StyleGAN2 designs with blur operations in the conv layers (I suppose is for anti-aliasing). However, the blur operations also wipe out some of the high-frequency information, which should be crucial for detail reconstruction. Despite the high-frequency information is later infused with randomized noise injection in the decoder, it can never be a faithful reconstruction of the input. It seems to me that the reconstruction should be more important than anti-aliasing. Could you clarify a bit on this design choice?

  2. Randomized noises in the decoder
    Similar to 1., the randomized noises injection in the decoder has no information from the input image, thus it should negatively affect the reconstruction quality. It seems a bit counter-intuitive to me in terms of image reconstruction.

  3. I found you slightly tweaked the weight demodulation, which I suppose is for style fusion in the spatial dimension. Have you ablated the performance-wise (e.g., FID or reconstruction) differences caused by the modification? This was difficult in the original weight demodulation design, which forced us to derive a slower but equivalent fusion function in our InfinityGAN (Figure 20). Thus it would be a great change if such a modification in weight demodulation does not degrade the generator performance.

Sincerely sorry for the excessively long questions and looking forward to your answers!

hubert0527 avatar Jun 24 '21 00:06 hubert0527

Thanks for the question -- I'll speak to (1) a bit. The act of subsampling the feature map (not blurring!) already commits you to either remove or misrepresent high-frequency information. Without blurring, you are misrepresenting it. That's what aliasing is; high-freq information gets entangled into the low-freq. The act of blurring is saying you would rather not represent than actively misrepresent the information.

richzhang avatar Jun 24 '21 16:06 richzhang

Thanks for the comments! Now I understand the intuitions! However, regarding the subsampling part, the SwappingAutoencoder uses strided convolution to reduce the spatial dimension in the encoder. Consider it is learnable and represents a mixing set of high-pass or low-pass filters, I'm not sure if it is safe to say that it guarantees the high-frequency information is either removed or misrepresented[1]. In contrast, a blur operation is a deterministic low-pass filter that guarantees some information is eliminated.

[1] Consider the input/output images are discrete and finite (i.e, 0-255 in uint8 than normalize to [-1, 1] in float32), and the intermediate features are discrete (but much finer-grained than image colors) and infinite with float32, the cardinality of all intermediate features are much larger than the input/output images. It is a bit hard to reject the possibility that the encoder can still preserve the high-frequency information in certain ways, despite it is also empirically known that the existing autoencoders are still far from perfect reconstructions.

hubert0527 avatar Jun 24 '21 21:06 hubert0527

Note the blur happens after the conv-relu feature extractor (which is free to learn hf/lf filters), immediately before subsampling (which will cause aliasing)

richzhang avatar Jun 25 '21 02:06 richzhang