GradTTS icon indicating copy to clipboard operation
GradTTS copied to clipboard

Wave sampels generated

Open Liujingxiu23 opened this issue 4 years ago • 4 comments

Thank you for your greate work and share. Could you please release some wave sampels? or Could you simplly evaluate the quality of the synthsized wavs you got, is them as good as the original paper claimed ?

Liujingxiu23 avatar Jul 19 '21 03:07 Liujingxiu23

Thank you for your greate work and share. Could you please release some wave sampels? or Could you simplly evaluate the quality of the synthsized wavs you got, is them as good as the original paper claimed ?

We have some preliminary results, but the results are average, so we are adjusting the training strategy and trying more datasets (Mandarin dataset), and we will update up when we have new results.

WelkinYang avatar Jul 19 '21 04:07 WelkinYang

@WelkinYang Have you figured it out, why samples are average before ?

rishikksh20 avatar Jul 28 '21 05:07 rishikksh20

@WelkinYang Have you figured it out, why samples are average before ?

We found that it is the vocoder that causes poor quality of the generated audio when using the pre-trained waveglow (https://github.com/NVIDIA/waveglow), while when we use the pre-trained hifigan model (https://github.com/jik876/ hifi-gan), the performance of the original paper can be achieved. We also verified the performance on the Mandarin dataset, with poorer performance on both pronunciation and duration, and we are continuing to experiment with replacing the structure of the text encoder (the performance of the transformer on the Mandarin dataset is questionable), and removing MAS and using explicit duration modeling just like Fastspeech.

WelkinYang avatar Jul 28 '21 05:07 WelkinYang

@WelkinYang Exactly I am also working to replace whole Text Encoder with FS2 Encoder like Diff-TTS.

rishikksh20 avatar Jul 28 '21 06:07 rishikksh20