WaveRNN
WaveRNN copied to clipboard
about train_tacotron loss definition
In your implementation of tacotron, mel_outputs of decoder are fed into the postnet(a cbhg) and a linear layer, then getting a new outputs with the same dims of mel_outputs. You did so for the purpose of connecting WaveRNN. However, the loss function is defined as m1_hat, m2_hat, attention = model(x, m) m1_loss = F.l1_loss(m1_hat, m) m2_loss = F.l1_loss(m2_hat, m) loss = m1_loss + m2_loss
m1_hat(which is mel_outpus) and m2_hat(which is the output after post-processing) are both in constraint with true mel by L1 loss. If i understand correctly, doesn't this postnet, together with linear layer sitll have effects? I mean the decoder outputs of a well-trained model should have already been very close to the true mel. Why would you still add this postnet?
@MorganCZY I'm just following the papers - they recommend the postnet so I just went with it without questioning it much. The Tacotron 1 model is very lightweight as far as parameter counts go so I think it can't do any harm to help it a little.
Hi @fatchord , I read the paper and your code, if I am not understanding wrong, m1_hat
is the mel-scale output, and m2_hat
is linear-scale output, why do you calculate both outputs against the mel-scale m
only? Shouldn't we calculate m2_hat
with another linear-scale m2
?
@noeruh I'm not predicting the linear-scale output - only mels both before and after the postnet. The reason being so that I can condition the wavernn model later.
From 1703.10135
As mentioned above, the post-processing net’s task is to convert the seq2seq target to a target that can be synthesized into waveforms. Since we use Griffin-Lim as the synthesizer, the post-processing net learns to predict spectral magnitude sampled on a linear-frequency scale. ... We use a simple ℓ1 loss for both seq2seq decoder (mel-scale spectrogram) and post-processing net (linear-scale spectrogram). The two losses have equal weights.
I noticed this same issue when finding that the "linear" spectrograms predicted by your model were no different from the mels. Since your postnet output is still mel-scale, a comment in the code would have been very much appreciated, especially since it is a deviation from the paper.
Still, the model performs very well and your implementation is the cleanest that I have seen to date. Thanks @fatchord.