Questions regarding the pretrained text aligner
Hi,
Thanks for the great work! I've been working with the model as a backbone for Mandarin TTS.
Recently I noted a couple of problems with the pretrained text aligner, and wonder if these could pose a disturbance in generation of the gt durations and training of the duration predictor (there've been defects with phrase breaking in my synthesized speech):
-
It seems the training of the AuxiliaryASR has taken to a different dictionary of symbol representation, involving the "sos" "eos" "unk" tokens. Normally this would not incur a problem, but since the aligner is finetuned in the first stage using TMA, I reckon there could be confusion with ASRS2S decoding? https://github.com/yl4579/StyleTTS/blob/main/Utils/ASR/models.py#L128 Here the text input is randomly masked filled with "unk" tokens, whose index is set as 3, yet for the TTS model a 3 in text input would point to the comma (","). I guess this won't be that much of a problem on the whole, but wonder if this could suggest some potential mismatch in text processing between the pretraining (as AuxiliaryASR) and finetuning (in the TTS model) stages.
-
The gt durations derived from the text aligner seem problematic when it comes to phrase breaks (with a corresponding blank in the text), tending to assign a long duration to the last phoneme before the break, rather than to the break (blank in text) itself, for example in utterance of "abc de" if there is a 10-frame pause between "abc" and "de", the derived gt duration would probably have a 10 on "c" and only 2 on " " or something, whereas we shall expect the 10 frames of pause to be assigned to " ". I figure this to be an inherent problem with the ASR-based alignment approach, since an ASR model is not meant to identify blanks and the CTC loss would deal with blanks in a particular manner. But it feels crucial to have correct gt duration for phrase breaks in order to correctly expand phonemes into frames.
I hope I'm not getting anything wrong here. Will there be a way to fix the potential problems?