FastSpeech2 icon indicating copy to clipboard operation
FastSpeech2 copied to clipboard

Pitch Loss and Energy Loss is so large

Open cuongnguyengit opened this issue 3 years ago • 6 comments

Hello.

I am a big fan of you . I tried to train FastSpeech2 model for my language but the result is not too good. I used character level for MFA aligner and a dataset contain one speaker. My batch size is 64. After 100k step, duration loss and mel loss is quite good (~0.2) but energy loss and pitch loss are so big (pitch ~ 22, energy ~ 15). During 800k last step, the losses are same. The quality audio is not natural.

Can you help me to fix the problem? I believe that my data is good because it is used for Tacotron model well.

Thanks you.

cuongnguyengit avatar May 10 '21 02:05 cuongnguyengit

@cuongnguyengit Are the MFA boundaries accurate? And how do you think about the results, for example, do you think the pitch or prosody of the synthesized samples is strange, or the output is noisy?

ming024 avatar May 11 '21 14:05 ming024

it is difficult to ensure that the alignment of character is as good as my own eyes or may be i dont know but it is still a problem i cant understand. If durations from MFA is good and silences are accurate for training, it seems that the predictions we will not include the silence and so we will lose the natural in result. In my case the audio will be read faster without a natural break. In addition, I also noticed that when the letters are arranged in succession it causes a little bit of confusion about words that lead to wrong reading.

My output is not noisy. Sounds are quite normal and voice quality is low, not natural, sometime wrong spell.

I will show you the pitch or prosody of the synthesized samples. Do you think about my problem?

Thanks

cuongnguyengit avatar May 11 '21 21:05 cuongnguyengit

fastspeech2_2_993 fastspeech2_2_249

cuongnguyengit avatar May 13 '21 09:05 cuongnguyengit

@ming024 I dont know whether there are special things in here.

cuongnguyengit avatar May 13 '21 09:05 cuongnguyengit

@cuongnguyengit I am just guessing that maybe you forget to normalize the pitch and energy features so the pitch and energy losses are so large. Just turn on preprocessing.pitch.normalization and preprocessing.energy.normalization in preprocess.yaml. For the lack of natural pauses in the synthesized audio samples, you can consider replacing the punctuations in your transcriptions with the "sp" phoneme, which corresponds to a short-pause token.

ming024 avatar May 26 '21 08:05 ming024

How will 'Sp' is added? They must be ruled or done to be smart and natural?

Thank you.

Vào 15:03, Th 4, 26 thg 5, 2021 Chung-Ming Chien @.***> đã viết:

@cuongnguyengit https://github.com/cuongnguyengit I am just guessing that maybe you forget to normalize the pitch and energy features so the pitch and energy losses are so large. Just turn on preprocessing.pitch.normalization and preprocessing.energy.normalization in preprocess.yaml. For the lack of natural pauses in the synthesized audio samples, you can consider replacing the punctuations in your transcriptions with the "sp" phoneme, which corresponds to a short-pause token.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ming024/FastSpeech2/issues/58#issuecomment-848557689, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKZGI7LSXOPLMOHFS6IMNPTTPSTLJANCNFSM44QBMC4A .

cuongnguyengit avatar Jun 25 '21 13:06 cuongnguyengit