FastSpeech2
FastSpeech2 copied to clipboard
Pitch Loss and Energy Loss is so large
Hello.
I am a big fan of you . I tried to train FastSpeech2 model for my language but the result is not too good. I used character level for MFA aligner and a dataset contain one speaker. My batch size is 64. After 100k step, duration loss and mel loss is quite good (~0.2) but energy loss and pitch loss are so big (pitch ~ 22, energy ~ 15). During 800k last step, the losses are same. The quality audio is not natural.
Can you help me to fix the problem? I believe that my data is good because it is used for Tacotron model well.
Thanks you.
@cuongnguyengit Are the MFA boundaries accurate? And how do you think about the results, for example, do you think the pitch or prosody of the synthesized samples is strange, or the output is noisy?
it is difficult to ensure that the alignment of character is as good as my own eyes or may be i dont know but it is still a problem i cant understand. If durations from MFA is good and silences are accurate for training, it seems that the predictions we will not include the silence and so we will lose the natural in result. In my case the audio will be read faster without a natural break. In addition, I also noticed that when the letters are arranged in succession it causes a little bit of confusion about words that lead to wrong reading.
My output is not noisy. Sounds are quite normal and voice quality is low, not natural, sometime wrong spell.
I will show you the pitch or prosody of the synthesized samples. Do you think about my problem?
Thanks
@ming024 I dont know whether there are special things in here.
@cuongnguyengit I am just guessing that maybe you forget to normalize the pitch and energy features so the pitch and energy losses are so large. Just turn on preprocessing.pitch.normalization
and preprocessing.energy.normalization
in preprocess.yaml
.
For the lack of natural pauses in the synthesized audio samples, you can consider replacing the punctuations in your transcriptions with the "sp" phoneme, which corresponds to a short-pause token.
How will 'Sp' is added? They must be ruled or done to be smart and natural?
Thank you.
Vào 15:03, Th 4, 26 thg 5, 2021 Chung-Ming Chien @.***> đã viết:
@cuongnguyengit https://github.com/cuongnguyengit I am just guessing that maybe you forget to normalize the pitch and energy features so the pitch and energy losses are so large. Just turn on preprocessing.pitch.normalization and preprocessing.energy.normalization in preprocess.yaml. For the lack of natural pauses in the synthesized audio samples, you can consider replacing the punctuations in your transcriptions with the "sp" phoneme, which corresponds to a short-pause token.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ming024/FastSpeech2/issues/58#issuecomment-848557689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZGI7LSXOPLMOHFS6IMNPTTPSTLJANCNFSM44QBMC4A .