Corentin Jemine

Results 17 comments of Corentin Jemine

My implementation of the synthesizer and of the vocoder aren't that great, and I've also trained on LibriSpeech when LibriTTS would have been preferable. I think fatchord's WaveRNN is very...

My voice works poorly with the model, others work nice. I would not recommend using a 30 mins audio file. While technically it should work, the framework is meant to...

> The speaker has a very thick voice, but the cloned result sounds like a normal person. Yes, the synthesizer is only trained to output a voice at all times....

You'll need to retrain with your own datasets to get another language running (and it's a lot of work). The speaker encoder is somewhat able to work on a few...

You'll need a good dataset (at least ~300 hours, high quality and transcripts) in the language of your choice, do you have that?

From [here](https://matheo.uliege.be/bitstream/2268.2/6801/5/s123578Jemine2019.pdf#page=12): > A particularity of the SV2TTS framework is that all models can be trained separately and on distinct datasets. For the encoder, one seeks to have a model...

You'd want hundreds of speakers at least. In fact, LibriSpeech-clean makes for 460 speakers and it's still not enough.

That's not nearly enough to learn about the variations in speakers. Especially not on a hard language such as Chinese.

You actually want the encoder dataset not to always be of good quality, because that makes the encoder robust. It's different for the synthesizer/vocoder, because the quality is the output...

You can do that, but I would then add the synthesizer dataset in the speaker encoder dataset. In SV2TTS, they use disjoint datasets between the encoder and the synthesizer, but...