Mispronounce some words and 44,1 Khz audio

Open tuannvhust opened this issue 3 years ago • 2 comments

some people claim that Mispronounciation is one of the noticeable disadvantages of VITS model. I experienced the same problem too. Does anybody know what is the reason of mispronounciation?
I used the 44,1 Khz dataset to train the model. Because the higher resolution of the data, it seems synthesized speech shows the noise more significantly. Can anybody give me some suggestions for this problem.

Sep 30 '22 10:09 tuannvhust

It can be eSpeak phonemizer problem. You can edit text preprocessing scripts to make it accept IPA phonemes directly and change them as you need.

Jan 11 '23 05:01 nikich340

Hi, I also suffered the mis-pronounciation issue when using Chinese phoneme as input, any update there? It seems that the trained model with LJSpeech dataset by using IPA input does not suffer the mis-pronounciation issue, or just because English is not my mother tongue that I could not notice the mis-pronouncitation badcase?

Apr 07 '23 01:04 weixsong