tacotron
tacotron copied to clipboard
Difference Between Current Code and Original Paper
-
Learning rate decay
. In the original paper, the learning rate decay starts from 0.001 and is reduced to 0.0005, 0.0003, and 0.0001 after 500K, 1M and 2M global steps respectively. While the code uses a fixed learning rate of 0.001. -
no batch-norm for conv1d in encoder (https://github.com/Kyubyong/tacotron/issues/12).
-
wrong size of conv1d in CBHG, Post-processing net (https://github.com/Kyubyong/tacotron/issues/13)
-
CBHG structure in post-processing net does not use
residual connection
. This may be a compromise, because the residuals are added only if the dimensions are the same. The original paper is unclear. -
The last layer of decoder uses a fully connected layer to predict the mel spectrogram. The paper says that it is an important trick to predict
r
frames at each decoder step. It is unclear whetherT = T' or T! = T'
in the process of [N, T, C] -> [N, T ', C * r]. The code keepsT=T'
, but it is also possible thatT' = T / r
with frame reduction. -
Decoder input problem
. The paper says, in inference, only thelast
frame of ther
predictions is fed into the decoder (except for the last step). However, the code usesall
of the r frames. During training, there are the same problem, everyr-th
ground truth frame is fed into the decoder, rather than all of the r frames.
What about the pre-emphasis 0.97?