Comprehensive-Transformer-TTS icon indicating copy to clipboard operation
Comprehensive-Transformer-TTS copied to clipboard

Are sent and word duration loss necessary for unsupervised alignment ?

Open xiaoyangnihao opened this issue 3 years ago • 3 comments

Are sent and word duration loss necessary for unsupervised alignment for a robust duration prediction?

xiaoyangnihao avatar Apr 27 '22 13:04 xiaoyangnihao

Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.

keonlee9420 avatar May 01 '22 04:05 keonlee9420

Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.

Thanks for your replay. By the way, in paper: "One TTS Alignment To Rule Them All", align modue use encoder outputs and mel as input for alignment, btw in your repo, align model use text_embedding as mel as inputs, have you done an experiment to compare this diff ?

xiaoyangnihao avatar May 11 '22 14:05 xiaoyangnihao

I just followed the Nemo's implementation, and I guess there is no specific reason for that.

keonlee9420 avatar May 27 '22 01:05 keonlee9420