GST-Tacotron
GST-Tacotron copied to clipboard
Reference Encoder Padding
How do we ensure that the padding of the reference mel spectogram is taken into account when the reference encoder is applied on a batch of mels?
Came you to any conclusion? I faced this problem too, since gst encoder takes zero paddings, the network is able to take into account the duration of the audio, which on my dataset led to the fact that short lines are pronounced slowly, and long fast.
I tried using one-dimensional convolution and masking zero before gru layer, but it worsened the work of tokens.