TransformerTTS
TransformerTTS copied to clipboard
Multiple Vocal voices
it seems that current implementation is designed to single vocal voice like LJspeech dataset which u used.. this dataset is 24 hour audio recording of single vocal.
I have a dataset of hundred of vocal , each vocal has less than hour audio recording.
Can u trained a multiple vocal model.
If not then is it possible that somehow i can perform fine tuning but taking your pre trained model and fine tune it on my single vocal with less than 1 hour audio recording
I have not experimented yet but in general this should be hard but doable. The results will probably vary. You can also experiment with adding some speaker embeddings (concatenating along the vertical axis of the input tokens for instance)
I have not experimented yet but in general this should be hard but doable. The results will probably vary. You can also experiment with adding some speaker embeddings (concatenating along the vertical axis of the input tokens for instance)
what about another issue of fine tuning existing model. since my single speaker has 40 minutes recording only.
Sorry I don't understand the difference with the previous question. You can do the following:
- train model from scratch and see what the results look like (very likely to not generalize well at all)
- take the pretrained model, and retrain it on you corpus (very likely to get better results than previous option, but sub-optimal quality, I have not experimented much on this yet)
- add speaker embeddings and train on the joint corpus. You might want to add more speakers to the dataset.
@arsalan993 - Did you try finetuning on your speaker with 40 mins of audio. How did it go ?