papernotes SinGAN: Learning a Generative Model from a Single Natural Image

SinGAN: Learning a Generative Model from a Single Natural Image

Open howardyclo opened this issue 5 years ago • 1 comments

Jun 11 '19 10:06 howardyclo

Learn an unconditional (i.e., generate from noise) generative model, SinGAN, that captures the internal statistics of a single training image.
SinGAN can generate new samples of arbitrary size and aspect ratio, yet maintain both the global structure and the fine textures of the training image.
SinGAN can be used to perform: paint-to-image translation, image editing, image harmonization, super-resolution and animation, without architecture changes or further tuning.

Multi-scale generators {G_0, ... G_N}, trained against an image pyramid of x: {x_0, ..., x_N}, where x_n is a downsampled version of x by a factor of r^n, from some r > 1.
Each generator G_n learns to fool an associated discriminator D_n.
Each discriminator D_n is a Markovian discriminator (PatchGAN), which tries to classify if each N×N patch in x_n is real or fake (i.e., map x_n to NxN array with each element ij signifies whether the patch ij in the x_n is real or fake).
The generation of an image sample starts at the coarsest scale (N), and sequentially feed into all generators up to the finest scale (0), with spatial white Gaussian noise injected at every scale, excepting that G_N purely generates image sample only from noise.
Equation version:
- Coarsest scale (n = N): x˜_N = G_N(z_N)
- Finer scale (n < N): x˜_n = G_n(z_n, upsample(x˜_{n+1})) = upsample(x˜_{n+1}) + ψ_n(z_n + upsample(x˜_{n+1})) (residual learning).
- ψ_n: FCN with 5 conv-blocks (Conv3x3-BatchNorm-LeakyReLU).
- Start with 32 kernels per block at the coarsest scale and 2x every 4 scales.
D_n has the same architecture as the G_n, so the patch size (receptive field) is 11 × 11 (can be computed here).
Can generate images of arbitrary size and aspect ratio at test time by changing the dimensions of the noise maps.

G_N generates the general layout of the image and the object's global structure.
G_n (n < N) adds details that were not generated by the previous scales.
The noise z_n serves as the prior to be fed into G_n, ensuring that the GAN does not disregard the noise, as often happens in conditional GANs.
Because the network has a limited receptive field (smaller than the entire image), it can generate new combinations of patches that do not exist in the training image.

Train GANs sequentially from coarse to fine scale. Once each GAN is trained, it is kept fixed.
Losses: adversarial_loss(G_n, D_n) + α * reconstruction_loss(G_n). Reconstruction loss insures G_n can produce x_n from a specific set of noise maps.
Adversarial loss: WGAN-GP loss (final discriminator loss = average over patch discrimination map).
Reconstruction loss (L_rec):
- Coarsest scale (n = N): L_rec = L2_norm(G_N(z) - x_N)*,
- Finer scale (n < N): L_rec = L2_norm(G_n(z=0, upsample(x˜rec_{n+1}),
- where z* is some fixed noise map (drawn once and kept fixed during training),
- x˜rec_n is the reconstruction image from G_n, and it is also used for determining the standard deviation σ_n of the noise z_n in each scale.
- σ_n is proportional to RMSE(upsample(x˜rec_{n+1}, x_n), which gives an indication of the amount of details that need to be added at that scale.
See hyperparameters settings in their supplementary material.

Jun 11 '19 16:06 howardyclo