MeloTTS icon indicating copy to clipboard operation
MeloTTS copied to clipboard

To train a new voice for English, how many hours of audio do you recommend?

Open xiao1ongbao opened this issue 1 year ago • 2 comments

To train a new voice for English, how many hours of audio do you recommend? Does the training script train from scratch or finetunes the existing model? Thanks!

xiao1ongbao avatar Sep 27 '24 02:09 xiao1ongbao

If one takes the G_0.pth (the first checkpoint) during training and uses it for inference, it speaks English with a young female voice that doesn't match the audio clips being trained on. So, it seems that it is fine-tuning that starting point.

As for duration of audio, I have gotten reasonable results with only 5 minutes of audio and 1k epochs with 48khz wav. Most people use 1+ hours, however.

iv2985 avatar Oct 27 '24 09:10 iv2985

I have 30 minute indian english audio, can you share some resources to help me in finetuning? Thanks @iv2985

rushichavda avatar May 24 '25 18:05 rushichavda