SpeechT5 Getting TTS output voice close to the training data

Getting TTS output voice close to the training data - Finetuning on different language

Open Srija616 opened this issue 2 years ago • 2 comments

Hi! I have fine-tuned SpeechT5 on one of our Hindi datasets transliterated to English. The pronunciation of words is quite good however the synthesized voice seems a bit mechanical and doesn't match that of training data (studio recorded male and female voice dataset). From what I understand, the synthesized speech depends on the speaker embeddings passed as argument to model.generate_speech and according to the fine-tuning colab tutorial, we can pass any speaker embeddings.

I would like to match the voice quality of the train dataset. I have trained the model for around 4000 steps at the same training hyperparams as defined in the Colab Finetuning official tutorial for the Dutch language.

Can you suggest ways to get close to the training data voice?

Jul 23 '23 18:07 Srija616

Any update on this?

Aug 03 '23 10:08 kdcyberdude

Hey can you pls share ur code file with me i am working on same project i want for reference

Mar 17 '24 15:03 Naman3007

SpeechT5 SpeechT5 copied to clipboard

Getting TTS output voice close to the training data - Finetuning on different language

SpeechT5
SpeechT5 copied to clipboard