Corentin Jemine
Corentin Jemine
My implementation of the synthesizer and of the vocoder aren't that great, and I've also trained on LibriSpeech when LibriTTS would have been preferable. I think fatchord's WaveRNN is very...
My voice works poorly with the model, others work nice. I would not recommend using a 30 mins audio file. While technically it should work, the framework is meant to...
> The speaker has a very thick voice, but the cloned result sounds like a normal person. Yes, the synthesizer is only trained to output a voice at all times....
You'll need to retrain with your own datasets to get another language running (and it's a lot of work). The speaker encoder is somewhat able to work on a few...
You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?
From [here](https://matheo.uliege.be/bitstream/2268.2/6801/5/s123578Jemine2019.pdf#page=12): > A particularity of the SV2TTS framework is that all models can be trained separately and on distinct datasets. For the encoder, one seeks to have a model...
You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.
That's not nearly enough to learn about the variations in speakers. Especially not on a hard language such as Chinese.
You actually want the encoder dataset not to always be of good quality, because that makes the encoder robust. It's different for the synthesizer/vocoder, because the quality is the output...
You can do that, but I would then add the synthesizer dataset in the speaker encoder dataset. In SV2TTS, they use disjoint datasets between the encoder and the synthesizer, but...