VAENAR-TTS icon indicating copy to clipboard operation
VAENAR-TTS copied to clipboard

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)?

Open seekerzz opened this issue 2 years ago • 16 comments

Hello, thanks for sharing the pytorch-based code! However, I have some question about the _initial_sample func in model/prior.py . epsilon is sampled from N(0, t) (t is the temperature), how its logprob is calculated? For norm distribution, image After log (the mean is 0) image. Can you explain why use \sigma as 1 instead of t here?

seekerzz avatar Sep 22 '21 02:09 seekerzz

Hi @seekerzz , t is always 1 in our setting.

keonlee9420 avatar Sep 27 '21 03:09 keonlee9420

Thanks for your reply! Have you tried the multi-speaker situation? I used the code for LibriTTS training. However, the performance is bad and KL is high (at the 10^3 level). I also added the initial process of mu and logvar from the flowseq repo (to make them output at around 0), but this won't help. I tried to first train the posterior (only use the mel loss )and then the prior (only use KL), but this still won't converge. I also checked whether the posterior P(Z|X,Y) and the decoder P(Y|Z,X) just discards the information of X (like an encoder-decoder of Y), but the decoder alignment shows that the information of X is used. Thus, this makes me wondering, why the prior fails to learn from the posterior:

  • Will it be too hard for Glow to learn it in the multi-speaker situation?
  • Or, maybe I should try the maximum likelihood training of Z instead of KL?

seekerzz avatar Sep 27 '21 03:09 seekerzz

By the way, this is my training curve image I did not train the length predictor (just using the ground truth length).

seekerzz avatar Sep 27 '21 03:09 seekerzz

Can you share the synthesized samples? And where did you apply the speaker information, e.g., speaker embedding?

keonlee9420 avatar Sep 27 '21 13:09 keonlee9420

Thanks for the quick reply!😁 I add the speaker embedding into the text embedding (as I think Z can be viewed a style mapping from text X to mel Y, adding speaker information to X is more intuitive) . However, the synthesized samples are still very bad after about 40 epochs on LibriTTS. For example, the predicted and the groundtruth. image image However, if only train the posterior, the predicted mel is quite OK. image

I read another flow-based TTS: Glow-TTS, and find that they conditioned the speaker information on Z. Maybe I should try their merging method.🤔

seekerzz avatar Sep 27 '21 14:09 seekerzz

Thanks for the sharing. So if I understood correctly, you add the speaker embedding to the text embedding right after the text encoder so that both posterior and prior encoder can take the speaker-dependent hidden representations X, am I right? If so, is it different from the Glow-TTS' conditioning method as they explained?

To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension. The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning

I quoted it from section 4 of the Glow-TTS paper.

keonlee9420 avatar Sep 27 '21 14:09 keonlee9420

Yes! I am going to try their conditioning method. If it succeed I will share the result.😊

seekerzz avatar Sep 27 '21 14:09 seekerzz

Ah, I see. I think It should work if you adopt the same way. Looking forward to seeing it!

keonlee9420 avatar Sep 27 '21 14:09 keonlee9420

@seekerzz hey, have you made any progress?

keonlee9420 avatar Oct 14 '21 10:10 keonlee9420

Hello! I find there might be a mistake in the code (just now)! In VAENAR.py image But in posterior.py image I'm trying to train the multi-speaker version again to see the results.😁 (Curious about why LJSpeech still works, haha)

seekerzz avatar Oct 14 '21 10:10 seekerzz

Great! Hope to get the clear sample soon. That's intended since we are not interested in the alignment from the posterior, so you should get no error from it when you use the same code for the multi-speaker setting.

keonlee9420 avatar Oct 14 '21 11:10 keonlee9420

Hello, I mean the position of Mu and Logvar are misplaced.

seekerzz avatar Oct 14 '21 11:10 seekerzz

Ah, sorry for the misunderstanding. Yes, you're right. It should be switched. But the reason why it's still working is that they are the same but wrongly named (reversed). In other words, mu_projection in the current implementation predicts logvar, and logvar_projection predicts mu. I will retrain the model with this fixation when I have room for that. Thanks for the report!

keonlee9420 avatar Oct 14 '21 11:10 keonlee9420

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, image Although the decoder attention seems a little noisy, it is correct, image

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). image I don't know whether this can be a problem for the flow-based model.🤔

seekerzz avatar Oct 15 '21 01:10 seekerzz

@seekerzz Could you share any synthesized samples?

wizardk avatar Dec 30 '21 08:12 wizardk

hi,I have met the same problem when I joined a vq encoder after posterior and prior encoder.The kl was 1e+4 and won't converged.Did you finish the job?

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, image Although the decoder attention seems a little noisy, it is correct, image

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). image I don't know whether this can be a problem for the flow-based model.🤔

whh07141 avatar Apr 06 '22 07:04 whh07141