PPSpeech End-to-End model or trainning of Text to Speech

End-to-End model or trainning of Text to Speech

Open Liujingxiu23 opened this issue 4 years ago • 3 comments

@rishikksh20 I notice that tts is one of your interested suject. I wonder if you have pay special attention on end-to-end tts model or trainning ,which means train and inference text-> wav directly? Do you have any suggestion on this subject?

Mar 02 '21 07:03 Liujingxiu23

@Liujingxiu23 https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech doing the same. We mostly used 2 different model one for text to Mel and other Mel to vocoder just to simplify things. End to end models are much more complex in nature.

Mar 03 '21 00:03 rishikksh20

Because converting text to wav directly is very costly task that's why we need to deal it with better ways so we generally use intermediate feat i.e. melspectrogram then text to mels and mels to wav.

Mar 03 '21 00:03 rishikksh20

@rishikksh20 Thank you for your reply. I know that the current solution of tts is "text->mel" + "mel->wav" in common. The synthesized wavs are really very good. The only quetion is sometimes, the timbre (may be similarity) of the synthesized wavs are a little differ from the original recording. So I think maybe "text->wav" directly may solve the problem since we I look at the mels synthesized , I found they are differ from the original mels in details. Do your have any suggestion about the "timbre" difference between the synthesized wavs and recording wavs?

Regarding the paper "EATS", I read but have not write python scripts to test this method.

Mar 03 '21 01:03 Liujingxiu23

PPSpeech PPSpeech copied to clipboard

End-to-End model or trainning of Text to Speech

PPSpeech
PPSpeech copied to clipboard