SinGAN
SinGAN copied to clipboard
Problem in gradient penalty
It seems calc_gradient_penalty is based on this:
https://github.com/caogang/wgan-gp/blob/ae47a185ed2e938c39cf3eb2f06b32dc1b6a2064/gan_mnist.py#L129-L149
Both use same expression for computing gradient_penalty.
gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean() * LAMBDA
But there's difference in gradients.
In the other repository, shape of gradients is [BATCH_SIZE, OUTPUT_DIM]. So gradients.norm(2, dim=1) computes norm of whole features for each image.
But in this repository, shape of gradients is [1, 3, HEIGHT, WIDTH]. so gradients.norm(2, dim=1) computes pixelwise gradient norms.
Therefore, it leads each pixelwise gradient norm to 1.
I afraid it can break 1-lipschitz constraint of WGAN.
Related issue: https://github.com/caogang/wgan-gp/issues/47
Thanks. We intentionally did it this way.
This is because we refer the patches of the image as our data-set, and we want the discriminator to be 1-lipschitz per patch. This leads to this pixel-wise norm term, as the output of the discriminator is a map with a score per patch.
This is in contrast to "regualr" WGAN-GP trained on a dataset of images, which requires 1-lipschitz per input image. In this case you will indeed take the gradient norm per the full image.
Ah, it slipped my mind that discriminator is seeing multiple patches. Thank you for your explanation.
and we want the discriminator to be 1-lipschitz per patch.
I afraid it is not achieved yet.
For example, assume ker_size=3 and num_layer=5.
If input shape is [1, 3, 11, 11], output shape will be [1, 1, 1, 1].
Since we got single output like regular WGAN-GP, gradient_penalty must be computed with norm of whole image (only one patch), not pixel-wise.
Therefore, what we need for 1-lipschitz constraint is not pixel-wise penalty but patch-wise penalty. (I have no idea how to compute it efficiently.)
What do you think?
~I think this is the canonical way to compute gradient penalty.~ I found that since discriminator has batch normalization layers, the code below is not correct...
# Create interpolates as usual
interpolates = alpha * real_data + ((1 - alpha) * fake_data) # [1, 3, HEIGHT, WIDTH]
# Extract patches
patches = torch.nn.functional.unfold(interpolates, 11) # [1, 3*11*11, NUM_PATCHES]
patches = patches.squeeze().transpose(0, 1) # [NUM_PATCHES, 3*11*11]
patches = patches.view([-1, 3, 11, 11]) # Now we got patch images
patches = patches.detach().requires_grad_().to(device)
disc_patches = netD(patches) # [NUM_PATCHES, 1, 1, 1]
# gradients has shape [NUM_PATCHES, 3, 11, 11]
gradients = torch.autograd.grad(outputs=disc_patches, inputs=patches,
grad_outputs=torch.ones(disc_patches.size()).to(device),
create_graph=True, retain_graph=True, only_inputs=True)[0]
# Compute mean of patch-wise gradient penalty
gradient_penalty = ((gradients.view([-1, 3*11*11]).norm(2, dim=1) - 1) ** 2).mean() * LAMBDA
@t-ae did you try your proposed patch? What were your results? @tamarott any thoughts on the previous comments re: pixel-wise vs patch-wise?
Hi, You can transfer it to be a loss over the whole image by flattening the data:
gradients = gradients[0].view(real_data.size(0), -1)
before calculating the penalty itself. The whole model may need additional hyper parameter adjustment.
Long comment on dead thread incoming, sorry!
I agree with @t-ae that the current gradient penalty code does not enforce a 1-Lipschitz discriminator over patches. In this case, the discriminator is a function f : R^d -> R^k, where k is the number of patches. Usually (for example, in image discrimination) it would be f : R^d -> R. As @tamarott said, the regular WGAN gradient penalty is not appropriate here. But I don't think that forcing the penalty over the whole image is any more correct.
Instead of penalizing the whole function, you want each output f_i to be 1-Lipschitz. Or equivalently, the norm of each row in the input-output Jacobian should be bounded 1.
You could do this exactly with k backward passes by computing the Jacobian matrix directly. But probably you can do only slightly worse, with single (or small number of) backward passes.
- You could approximate the Frobenius norm of the Jacobian via the Hutchinson trace estimator. To do this, sample noise
vand compute the Jacobian vector productv^T J. The dot product(v^T J)^T J v^Tgives an unbiased estimate for the Frobenius norm. This lets you impose the constraint on the sum of the gradient norms. - You could bound (an approximation of) the spectral norm of the Jacobian (or some other induced matrix norm) via power iteration, like Appendix F of "Sorting out Lipschitz function approximation". This guarantees a bound on the row norms too.
- You can get an unbiased stochastic estimate for the gradient norms themselves, with a similar idea to the Hutchinson trace estimator. Sample noise,
v, and computev^TJwith one pass, then(v^T J)^T Jwith another. Finally,(v^T J)^T J ⊙ vgives you an unbiased estimate of the norms of the rows of the Jacobian. (Thanks @rtqichen for this one.)
Ultimately, I don't think these changes are necessary (the code works!). The current approach is sufficient to prevent the discriminator from allocating large gradient directions to "cheat" and so the WGAN training works fine.