PPSpeech
PPSpeech copied to clipboard
End-to-End model or trainning of Text to Speech
@rishikksh20 I notice that tts is one of your interested suject. I wonder if you have pay special attention on end-to-end tts model or trainning ,which means train and inference text-> wav directly? Do you have any suggestion on this subject?
@Liujingxiu23 https://deepmind.com/research/publications/End-to-End-Adversarial-Text-to-Speech doing the same. We mostly used 2 different model one for text to Mel and other Mel to vocoder just to simplify things. End to end models are much more complex in nature.
Because converting text to wav directly is very costly task that's why we need to deal it with better ways so we generally use intermediate feat i.e. melspectrogram then text to mels and mels to wav.
@rishikksh20 Thank you for your reply. I know that the current solution of tts is "text->mel" + "mel->wav" in common. The synthesized wavs are really very good. The only quetion is sometimes, the timbre (may be similarity) of the synthesized wavs are a little differ from the original recording. So I think maybe "text->wav" directly may solve the problem since we I look at the mels synthesized , I found they are differ from the original mels in details. Do your have any suggestion about the "timbre" difference between the synthesized wavs and recording wavs?
Regarding the paper "EATS", I read but have not write python scripts to test this method.