tacotron icon indicating copy to clipboard operation
tacotron copied to clipboard

Difference Between Current Code and Original Paper

Open candlewill opened this issue 7 years ago • 1 comments

  1. Learning rate decay. In the original paper, the learning rate decay starts from 0.001 and is reduced to 0.0005, 0.0003, and 0.0001 after 500K, 1M and 2M global steps respectively. While the code uses a fixed learning rate of 0.001.

  2. no batch-norm for conv1d in encoder (https://github.com/Kyubyong/tacotron/issues/12).

  3. wrong size of conv1d in CBHG, Post-processing net (https://github.com/Kyubyong/tacotron/issues/13)

  4. CBHG structure in post-processing net does not use residual connection. This may be a compromise, because the residuals are added only if the dimensions are the same. The original paper is unclear.

  5. The last layer of decoder uses a fully connected layer to predict the mel spectrogram. The paper says that it is an important trick to predict r frames at each decoder step. It is unclear whether T = T' or T! = T' in the process of [N, T, C] -> [N, T ', C * r]. The code keeps T=T', but it is also possible that T' = T / r with frame reduction.

  6. Decoder input problem. The paper says, in inference, only the last frame of the r predictions is fed into the decoder (except for the last step). However, the code uses all of the r frames. During training, there are the same problem, every r-th ground truth frame is fed into the decoder, rather than all of the r frames.

candlewill avatar Jun 08 '17 03:06 candlewill

What about the pre-emphasis 0.97?

onyedikilo avatar Jun 08 '17 09:06 onyedikilo