Comprehensive-Transformer-TTS Are sent and word duration loss necessary for unsupervised alignment ?

Are sent and word duration loss necessary for unsupervised alignment ?

Open xiaoyangnihao opened this issue 3 years ago • 3 comments

Are sent and word duration loss necessary for unsupervised alignment for a robust duration prediction?

Apr 27 '22 13:04 xiaoyangnihao

Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.

May 01 '22 04:05 keonlee9420

Hi @xiaoyangnihao , I’m not sure about the robustness, but it’s working for the correctness(accuracy) of the pause and hence naturalness. The effect is maximized when your dataset has complex punctuation rules.

Thanks for your replay. By the way, in paper: "One TTS Alignment To Rule Them All", align modue use encoder outputs and mel as input for alignment, btw in your repo, align model use text_embedding as mel as inputs, have you done an experiment to compare this diff ?

May 11 '22 14:05 xiaoyangnihao

I just followed the Nemo's implementation, and I guess there is no specific reason for that.

May 27 '22 01:05 keonlee9420

Comprehensive-Transformer-TTS Comprehensive-Transformer-TTS copied to clipboard

Are sent and word duration loss necessary for unsupervised alignment ?

Comprehensive-Transformer-TTS
Comprehensive-Transformer-TTS copied to clipboard