WaveRNN icon indicating copy to clipboard operation
WaveRNN copied to clipboard

about train_tacotron loss definition

Open MorganCZY opened this issue 5 years ago • 4 comments

In your implementation of tacotron, mel_outputs of decoder are fed into the postnet(a cbhg) and a linear layer, then getting a new outputs with the same dims of mel_outputs. You did so for the purpose of connecting WaveRNN. However, the loss function is defined as m1_hat, m2_hat, attention = model(x, m) m1_loss = F.l1_loss(m1_hat, m) m2_loss = F.l1_loss(m2_hat, m) loss = m1_loss + m2_loss m1_hat(which is mel_outpus) and m2_hat(which is the output after post-processing) are both in constraint with true mel by L1 loss. If i understand correctly, doesn't this postnet, together with linear layer sitll have effects? I mean the decoder outputs of a well-trained model should have already been very close to the true mel. Why would you still add this postnet?

MorganCZY avatar Jul 29 '19 09:07 MorganCZY

@MorganCZY I'm just following the papers - they recommend the postnet so I just went with it without questioning it much. The Tacotron 1 model is very lightweight as far as parameter counts go so I think it can't do any harm to help it a little.

fatchord avatar Aug 14 '19 16:08 fatchord

Hi @fatchord , I read the paper and your code, if I am not understanding wrong, m1_hat is the mel-scale output, and m2_hat is linear-scale output, why do you calculate both outputs against the mel-scale m only? Shouldn't we calculate m2_hat with another linear-scale m2?

ghost avatar Oct 17 '19 18:10 ghost

@noeruh I'm not predicting the linear-scale output - only mels both before and after the postnet. The reason being so that I can condition the wavernn model later.

fatchord avatar Oct 23 '19 09:10 fatchord

From 1703.10135

As mentioned above, the post-processing net’s task is to convert the seq2seq target to a target that can be synthesized into waveforms. Since we use Griffin-Lim as the synthesizer, the post-processing net learns to predict spectral magnitude sampled on a linear-frequency scale. ... We use a simple ℓ1 loss for both seq2seq decoder (mel-scale spectrogram) and post-processing net (linear-scale spectrogram). The two losses have equal weights.

I noticed this same issue when finding that the "linear" spectrograms predicted by your model were no different from the mels. Since your postnet output is still mel-scale, a comment in the code would have been very much appreciated, especially since it is a deviation from the paper.

Still, the model performs very well and your implementation is the cleanest that I have seen to date. Thanks @fatchord.

ghost avatar Aug 07 '20 06:08 ghost