examples icon indicating copy to clipboard operation
examples copied to clipboard

Why don't we use MSE as a reconstruction loss for VAE ?

Open ahmed-fau opened this issue 7 years ago • 7 comments

Hi,

I am wondering if there is a theoretical reason for using BCE as a reconstruction loss for variation auto-encoders ? Can't we simply use MSE or norm-based reconstruction loss instead ?

Best Regards

ahmed-fau avatar Aug 07 '18 11:08 ahmed-fau

+1

nwesem avatar Aug 29 '18 21:08 nwesem

According to the original VAE paper[1], BCE is used because the decoder is implemented by MLP+Sigmoid which can be viewed as a 'Bernoulli distribution'. You can use MSE if you implement a Gaussian decoder. Take the following pseudocode for an example,

mu = MLP(z)
sigma = MLP(z)
reconstruction = Gaussian(mu, sigma)

Ref

[1] https://arxiv.org/abs/1312.6114

kingofspace0wzz avatar Sep 17 '18 10:09 kingofspace0wzz

because the decoder is implemented by MLP+Sigmoid which can be viewed as a 'Bernoulli distribution'.

Does this mean that in order to model "continuous" distribution like Gaussian then we should not use sigmoid as the output layer and replace it with tanh or even flattened conv layer for example ?

ahmed-fau avatar Sep 17 '18 11:09 ahmed-fau

Yes, that is the idea. However, I don't think tanh would be an appropriate choice for the output layer in this case, as we are fitting values in [0,1] rather than [-1,1]. In fact, an error will jump out if an negative input is feed into torch.nn.functional.binary_cross_entropy().

kingofspace0wzz avatar Sep 17 '18 20:09 kingofspace0wzz

Then using conv layer and flatten it could be a possible solution for modelling output layer ?

ahmed-fau avatar Sep 17 '18 20:09 ahmed-fau

because the decoder is implemented by MLP+Sigmoid which can be viewed as a 'Bernoulli distribution'.

Does this mean that in order to model "continuous" distribution like Gaussian then we should not use sigmoid as the output layer and replace it with tanh or even flattened conv layer for example ?

If I change https://github.com/pytorch/examples/blob/master/vae/main.py#L60 from .sigmoid() to .tanh(), and https://github.com/pytorch/examples/blob/master/vae/main.py#L74 from BCE to MSE, will that make this VAE to try a Gaussian Reconstruction and work for [-1, 1]? I would appreciate any input.

muammar avatar Oct 08 '19 17:10 muammar

@muammar To approximate a gaussian posterior, it usually works fine to use no activation function in the last layer and interpret the output as mean for a normal distribution. If we assume a constant variance for the posterior, we naturally end up with the MSE as loss function. An alternative option is proposed by An et al.. We can duplicate the output layer of the decoder to model the mean and variance of the normal distribution and then optimize the negative log likelihood.

GPla avatar Jul 22 '20 15:07 GPla