ForwardTacotron icon indicating copy to clipboard operation
ForwardTacotron copied to clipboard

Training ForwardTacotron on a dataset comprised of multiple male voices as a single speaker dataset?

Open tomsabanov opened this issue 3 years ago • 10 comments

Hi,

I was wondering, if it was possible to train on a dataset, that has let's say 2-3 male voices, each with about 10 hours of data.

Will the end result of this be a good neutral male voice?

tomsabanov avatar Aug 02 '21 18:08 tomsabanov

Hi, short answer is that the voice is going to be rubbish as the model will average them. I will probably implement a multispeaker version soon. The idea is to condition each voice on a speaker embedding, e.g. from https://github.com/resemble-ai/Resemblyzer and provide a reference embedding for inference. I had some success previously with this repo just doing that, but that's outdated already (the branch was done before implementing pitch and energy conditioning).

cschaefer26 avatar Aug 03 '21 07:08 cschaefer26

I've read in some master's thesis for Finnish language that the author received good results with a "warm start" method. He trained the base model on multiple voices of 20 hours and then trained a single voice on top of that model.

Would this idea work with ForwardTacotron?

tomsabanov avatar Aug 03 '21 09:08 tomsabanov

I think still this makes much more sense if you have the voice conditioning. Do the authors share their model architecture? I suspect they are using some speaker embedding.

cschaefer26 avatar Aug 03 '21 09:08 cschaefer26

The auther used Nvidia's implementation of Tacotron. They didn't change anything in the code. The following is extracted from the paper.

"Using a warm-starting training schema yielded better results. First, a general model was trained using all available data. The model had no information about the speaker, even though the targets consisted of mel-spectrograms created from multiple speakers’ voices. During training, the model had to generate utterances with many different voices for the same input. This prohibited the model from converging after a certain point. In the end, the general model produced very unnatural yet understandable speech. When creating an utterance, the model seemed to randomly "choose" a speaker from the training set and produce the rest of the utterance with that voice. Even though the speech sounded unnatural, it always clearly resembled a specific speaker’s voice from the training set. The weights of the general model were then used to initialize weights for an actual single speaker model. Experimenting with different ways of creating the initial model showed that using data from speakers of the same gender gave better results than having speech from both genders in the training data. In addition, letting the model train until the training error started to plateau worked better than stopping the training early."

tomsabanov avatar Aug 03 '21 10:08 tomsabanov

Ah very interesting. Could well be tried with this repo then. If there is enough data for each speaker, it could work. Just try it out and throw everything in. Carefully watch the tacotron training to see if the attention score jumps above 0.5 between 3k-10k steps. If its successful then you can wait until the alignments are extracted (after 40k tacotron trainnig steps) and then train your multispeaker forward tacotron until 50k steps or so and then start messing with the data (replace it with single-speaker data).

cschaefer26 avatar Aug 03 '21 10:08 cschaefer26

I will report my findings.

Thank you for your help.

tomsabanov avatar Aug 03 '21 10:08 tomsabanov

Good luck, lmk how it goes!

cschaefer26 avatar Aug 03 '21 10:08 cschaefer26

Haven't tried it but I found that speaker selection isn't random but usually by some similarity to training data sentences. Unfortunately it often overrides the speaker embedding in my case - pick a sentence from the training data of a speaker and the embedding vector of another speaker and you usually still get output of the first speaker, even if you slightly modify the sentence. For very long sentences it sometimes does switch mid-sentence. I tried reinforcing the speaker ID at multiple positions in the network but not really helping.

m-toman avatar Aug 03 '21 14:08 m-toman

I have another question regarding the fine-tuning of an existing model. Do I have to save both resulting models from train_tacotron.py and train_forward.py and then load them when I want to fine-tune them in their respective scripts?

How would I go about this?

tomsabanov avatar Aug 03 '21 20:08 tomsabanov

The tacotron is only used to extract phoneme durations from the dataset. Once you processed all voices at once you can simply use the latest forward model to fine-tune. You probably need to manually filter the data according to the speaker.

cschaefer26 avatar Aug 04 '21 07:08 cschaefer26