To train a new voice for English, how many hours of audio do you recommend?

Open xiao1ongbao opened this issue 1 year ago • 2 comments

To train a new voice for English, how many hours of audio do you recommend? Does the training script train from scratch or finetunes the existing model? Thanks!

Sep 27 '24 02:09 xiao1ongbao

If one takes the G_0.pth (the first checkpoint) during training and uses it for inference, it speaks English with a young female voice that doesn't match the audio clips being trained on. So, it seems that it is fine-tuning that starting point.

As for duration of audio, I have gotten reasonable results with only 5 minutes of audio and 1k epochs with 48khz wav. Most people use 1+ hours, however.

Oct 27 '24 09:10 iv2985

I have 30 minute indian english audio, can you share some resources to help me in finetuning? Thanks @iv2985

May 24 '25 18:05 rushichavda