GradTTS
GradTTS copied to clipboard
Wave sampels generated
Thank you for your greate work and share. Could you please release some wave sampels? or Could you simplly evaluate the quality of the synthsized wavs you got, is them as good as the original paper claimed ?
Thank you for your greate work and share. Could you please release some wave sampels? or Could you simplly evaluate the quality of the synthsized wavs you got, is them as good as the original paper claimed ?
We have some preliminary results, but the results are average, so we are adjusting the training strategy and trying more datasets (Mandarin dataset), and we will update up when we have new results.
@WelkinYang Have you figured it out, why samples are average before ?
@WelkinYang Have you figured it out, why samples are average before ?
We found that it is the vocoder that causes poor quality of the generated audio when using the pre-trained waveglow (https://github.com/NVIDIA/waveglow), while when we use the pre-trained hifigan model (https://github.com/jik876/ hifi-gan), the performance of the original paper can be achieved. We also verified the performance on the Mandarin dataset, with poorer performance on both pronunciation and duration, and we are continuing to experiment with replacing the structure of the text encoder (the performance of the transformer on the Mandarin dataset is questionable), and removing MAS and using explicit duration modeling just like Fastspeech.
@WelkinYang Exactly I am also working to replace whole Text Encoder with FS2 Encoder like Diff-TTS.