Comprehensive-Transformer-TTS
Comprehensive-Transformer-TTS copied to clipboard
Are sent and word duration loss necessary for unsupervised alignment ?
Are sent and word duration loss necessary for unsupervised alignment for a robust duration prediction?
Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.
Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.
Thanks for your replay. By the way, in paper: "One TTS Alignment To Rule Them All", align modue use encoder outputs and mel as input for alignment, btw in your repo, align model use text_embedding as mel as inputs, have you done an experiment to compare this diff ?
I just followed the Nemo's implementation, and I guess there is no specific reason for that.