amortized_svgd
amortized_svgd copied to clipboard
Question about code
Hi, thank you for your code which helped me understand some of the concepts from the paper better. I had a few remaining questions about the implementation, it would be great if you could clarify those.
g_optim.zero_grad() autograd.backward( -z_i, ##why minus, and why zi grad_tensors=svgd) g_optim.step()
I was a little confused about why we are taking the gradient with respect to -z_i and not z_i in the above lines and also why we are computing the kernel over two different batches of particles (z_i and z_j) rather than between the particles of just one batch.. is that to help with something like training stability for example? Thanks!
@gunshi
- You can compute the kernel among any number of particles. But as particles are like random samples, it might make sense to try to approximate the posterior with more draws from the generator.
- I think the
-z_i
term is wrong (see equation (10) from https://arxiv.org/pdf/1707.06626.pdf) - Still, the term
autograd.backward( z_i, grad_tensors=svgd)
is confusing. svgd is a tensor, so we need to compute the jacboian vector product to take gradients. Probably, we want to updatez_j
with respect to the svgd loss -- pytorch docs.
Empirically, the -z_i
term didn't work for me, and it maximized the loss as expected. Flipping it works fine. I also found that computing k(z_j, z_j)
, then updating w.r.t. z_j
converges faster to the MAP estimate, which (for me) is not desireable.
@gunshi @neale
- I think he uses -z_i because it's gradient ascent and not gradient descent, see equation 10,
- the problem for me is that he used the X coordinate in order to predict the Y coordinate of the particle, I find that weird, how can we mix two different axes? (see z_flow and data_energy in d_learn)
- what should we do if we want to sample particles with both coordinates (X and Y) instead of just Y
- he is passing the observed data to the generator together with the noise, why do we need to do that, especially since it's not relevant to the article.
- Also something that confused me more is that since in the article the discriminator has only one input (one particle), why he is using two inputs (two particles)
- Also the way he is updating the parameters and many other things are not similar to what they did in the article.
I tried to implement a version where everything it is exactly similar to the article, but for now, it is still not working, even if I tried everything similar to what they did in the article, here is the code if you want to take a look, (https://github.com/mokeddembillel/Amortized-SVGD-GAN)