deep-voice-conversion
deep-voice-conversion copied to clipboard
did anyone finish sequence to sequence attention training?
I write this referencing by https://github.com/keithito/tacotron, but it does not work. the ground truth mel-spectrogram as input can work, but predicted mel failed. Can anyone give me advises?
I also have this issue. The audios from validation process sound great, while in testing process, the predicted mel spec rather than the ground truth will be input into the next time-step's pre-net, which leads to quite abnormal generated audios. I found the alignment images were not in diagonal shape. It proves the attention mechanism hasn't been learned well. However i don't know how to adjust the model or the training strategy.
Yes, any clues on the Seq2Seq+Attention in this network will be great! Please update if anyone gets any solution. Thanks!