vmf_vae_nlp Reconstruct Results / Implementation Details

Hi,

I have some questions regarding the implementation, and I can't reproduce the perplexities reported in the paper.

I'd be interested in #5, as well.
I can't reproduce the results mentioned in the paper. Even with the configurations suggested in #3 and #7. My best model on PTB produces a PPL of 110.
Why don't you fix #4 in your code? I don't think the majority of users still uses pytorch0.3.1.
Why are you computing the perplexity like exp(recon_loss + kl)? As far as I understand (wikipedia, here and here), the perplexity is a measure of how well the model output matches the given data. It should treat the model as a black box, like exp(entropy) or exp(cross-entropy). Especially, models with a high value for kappa are impacted negatively, due to the constant kl term. E.g., kappa->infty, which is aequivalent to a (not variational) autoencoder on the hypersphere, always produces a PPL of infinity.
Why do you sample in the latent space multiple times (nsample here, set to 3 be default), and then compute the mean? As far as I understand, the sampling and mean process produces samples that are not part of the vMF anymore, and the resulting samples have a recuded variance compared to the original samples. The reduced variance could as well be achieved by increasing kappa, which would result in an increased PPL with the definition in point 3. The other linked implementation does not do this.

Thanks!

Mar 09 '19 08:03 thequilo

I just noticed that the two "optimal" hyperparameters you mentioned in #3 don't match the KL values from the paper for PTB. For the Standard setting, your hyperparameter suggests lat_dim=50 and kappa=5 or kappa=35, which produces a KLD of 0.2 or 7.6, respectively. For the 5.7 written in the paper, and the same dimension, kappa must be something between 28 and 29. The same for Yelp, the configuration from #3 produces a KLD of 19.6, not the mentioned 18.6.

The KLDs listed above were calculated using your implementation of vMF from vmf_batch with the following code:

>>> from NVLL.distribution.vmf_batch import *
>>> vMF(hid_dim=1, lat_dim=50, kappa=5).kld
tensor([0.2372], device='cuda:0')
>>> vMF(hid_dim=1, lat_dim=50, kappa=35).kld
tensor([7.6284], device='cuda:0')
>>> vMF(hid_dim=1, lat_dim=50, kappa=80).kld
tensor([19.5847], device='cuda:0')
>>> vMF(hid_dim=1, lat_dim=50, kappa=28.6).kld
tensor([5.6961], device='cuda:0')

Could you please provide the correct used configurations, or tell me whether I'm doing something wrong?

Mar 19 '19 21:03 thequilo

Hi,

I have some questions regarding the implementation, and I can't reproduce the perplexities reported in the paper.

I'd be interested in #5, as well.

I can't reproduce the results mentioned in the paper. Even with the configurations suggested in #3 and #7. My best model on PTB produces a PPL of 110.

Why don't you fix #4 in your code? I don't think the majority of users still uses pytorch0.3.1.

Why are you computing the perplexity like exp(recon_loss + kl)? As far as I understand (wikipedia, here and here), the perplexity is a measure of how well the model output matches the given data. It should treat the model as a black box, like exp(entropy) or exp(cross-entropy). Especially, models with a high value for kappa are impacted negatively, due to the constant kl term. E.g., kappa->infty, which is aequivalent to a (not variational) autoencoder on the hypersphere, always produces a PPL of infinity.

Why do you sample in the latent space multiple times (nsample here, set to 3 be default), and then compute the mean? As far as I understand, the sampling and mean process produces samples that are not part of the vMF anymore, and the resulting samples have a recuded variance compared to the original samples. The reduced variance could as well be achieved by increasing kappa, which would result in an increased PPL with the definition in point 3. The other linked implementation does not do this.

Thanks!

The test or evaluation could be done using eval_nvdm.py and eval_nvrnn.py
When talking about your best model, do you mean a vMF model or Gaussian or pure LSTM? I will take a look on this issue tho.
The update of pytorch is pretty frequent. I might not be able to handle the update timely.
recon_loss + kl is basically the NLL loss. In VAE we are referring to the ELBO, which is the black box you mentioned here.
The computation of expectation is intractable so we use the sampling method to approximate it. Multiple samples do help reduce the variance.

Sorry for late response. I will take a look at the questions mentioned and try to give better response.

Apr 13 '19 04:04 jiacheng-xu

vmf_vae_nlp vmf_vae_nlp copied to clipboard

Reconstruct Results / Implementation Details

vmf_vae_nlp
vmf_vae_nlp copied to clipboard