TransformerTTS Multiple Vocal voices

trafficstars

it seems that current implementation is designed to single vocal voice like LJspeech dataset which u used.. this dataset is 24 hour audio recording of single vocal.

I have a dataset of hundred of vocal , each vocal has less than hour audio recording.

Can u trained a multiple vocal model.

If not then is it possible that somehow i can perform fine tuning but taking your pre trained model and fine tune it on my single vocal with less than 1 hour audio recording

Aug 24 '20 19:08 arsalan993

I have not experimented yet but in general this should be hard but doable. The results will probably vary. You can also experiment with adding some speaker embeddings (concatenating along the vertical axis of the input tokens for instance)

Aug 25 '20 10:08 cfrancesco

I have not experimented yet but in general this should be hard but doable. The results will probably vary. You can also experiment with adding some speaker embeddings (concatenating along the vertical axis of the input tokens for instance)

what about another issue of fine tuning existing model. since my single speaker has 40 minutes recording only.

Aug 25 '20 10:08 arsalan993

Sorry I don't understand the difference with the previous question. You can do the following:

train model from scratch and see what the results look like (very likely to not generalize well at all)
take the pretrained model, and retrain it on you corpus (very likely to get better results than previous option, but sub-optimal quality, I have not experimented much on this yet)
add speaker embeddings and train on the joint corpus. You might want to add more speakers to the dataset.

Sep 21 '20 11:09 cfrancesco

@arsalan993 - Did you try finetuning on your speaker with 40 mins of audio. How did it go ?

Sep 26 '20 10:09 avilash

TransformerTTS TransformerTTS copied to clipboard

Multiple Vocal voices

TransformerTTS
TransformerTTS copied to clipboard