GradTTS Wave sampels generated

Thank you for your greate work and share. Could you please release some wave sampels? or Could you simplly evaluate the quality of the synthsized wavs you got, is them as good as the original paper claimed ?

Jul 19 '21 03:07 Liujingxiu23

Thank you for your greate work and share. Could you please release some wave sampels? or Could you simplly evaluate the quality of the synthsized wavs you got, is them as good as the original paper claimed ?

We have some preliminary results, but the results are average, so we are adjusting the training strategy and trying more datasets (Mandarin dataset), and we will update up when we have new results.

Jul 19 '21 04:07 WelkinYang

@WelkinYang Have you figured it out, why samples are average before ?

Jul 28 '21 05:07 rishikksh20

@WelkinYang Have you figured it out, why samples are average before ?

We found that it is the vocoder that causes poor quality of the generated audio when using the pre-trained waveglow (https://github.com/NVIDIA/waveglow), while when we use the pre-trained hifigan model (https://github.com/jik876/ hifi-gan), the performance of the original paper can be achieved. We also verified the performance on the Mandarin dataset, with poorer performance on both pronunciation and duration, and we are continuing to experiment with replacing the structure of the text encoder (the performance of the transformer on the Mandarin dataset is questionable), and removing MAS and using explicit duration modeling just like Fastspeech.

Jul 28 '21 05:07 WelkinYang

@WelkinYang Exactly I am also working to replace whole Text Encoder with FS2 Encoder like Diff-TTS.

Jul 28 '21 06:07 rishikksh20

GradTTS GradTTS copied to clipboard

Wave sampels generated

GradTTS
GradTTS copied to clipboard