speaker_adapted_tts Model fine tuning

Could you (or anyone else) please share how to do fine tuning to achieve these amazing results? I'm fairly new to tensorflow and deep learning in general so I would greatly appreciate an explanation.

Thanks!

Mar 22 '18 12:03 Ola-Vish

Hello, he is using multi-speakers in his training, apparently he uses the pre-trained model with LJ Speech Dataset, and re-train with the new speakers dataset, thus sharing the knowledge because everyone's English pronunciation is similar, I do not know how I am training DCTTS using multi-speakers, I would very much like to know ...

I was watching DCTTS , but I did not find anything about multi-speaker training.

How can we train the multi-speaker DCTTS?

Mar 25 '18 15:03 Edresson

Yes, I am very interested in this also. @Kyubyong , can you please elaborate on "...10 minutes of fine-tuning training " ?

Mar 26 '18 16:03 rocket-pig

@Kyubyong Could you elaborate more? How did you train for multiple speakers? Can you choose speaker when testing the samples in "synthesis" phase? Thanks.

Apr 10 '18 11:04 rsn

@Kyubyong Could you help us ? we are very interested in this. Thanks.

Apr 10 '18 16:04 Edresson

@Kyubyong Did you only do "train.py 2" with samples of new speaker? Did you use waits from LJ pre-trained models for this?

Apr 24 '18 11:04 rsn

@Kyubyong I also have a few questions about how you do the finetuning:

-Do you finetune both Text2Mel and SSRN ? Or only one of them (I'd say Text2Mel in that case) ? Or even only a part of Text2Mel ? -Do you change some hyperparameters such as the learning rate or the batch size ?

Thanks

Jun 07 '18 16:06 noetits

I am so impressed, @Kyubyong !

And like the others here, I'm super eager to learn more from you.

Today is my first day using TensorFlow (and anything related to deep learning... it's all new to me), and it's so fun to see that after only 2,000 "steps" (and I don't even know what a "step" means), the audio files being generated by my computer are intelligible.

And at step 6,500 (which is the latest my GPU has processed so far), it's starting to sound impressive.

I'm so interested to learn how I can "add" to the top of these Linda Johnson results and morph the voice based on (many fewer) audio clips of a 2nd speaker.

Thanks for any further guidance. You are amazing and inspiring. 💪

Aug 05 '18 22:08 ryancwalsh

Also struggling to get this working - I'm trying with 30 audio files of about 1.5 minutes of audio in total, training from the pretrained LJ model (trained up to step 724,000) and setting the batch size to 30 (so all files will be processed in every step). I've tried a starting learning rate of both 0.01 and 0.001. Attached is the new data I'm trying to finetune the pretrained model on, and some results from training up to step 735,000 with the new data (whether the results are from 1,000 extra steps or say 10,000 the results are terrible either way).

data.zip results.zip

Nov 27 '18 14:11 wanshun123

speaker_adapted_tts speaker_adapted_tts copied to clipboard

Model fine tuning

speaker_adapted_tts
speaker_adapted_tts copied to clipboard