metavoice-src
metavoice-src copied to clipboard
What is required audio length for fine tuning?
I split my audio into 5-10 second chunks. Is this normal for fine-tuning, or is there a specific range for audio chunks? I fine-tuned with my Uzbek language audio (approximately 30 hours, and my loss is not decreasing
Hey! 5-10 seconds should be enough, but note that during synthesis you'll struggle to generate more than 5-10 seconds at one time due to this...
hard to debug loss not decreasing without more info!
Hey @risqaliyevds, let us know if you have anymore info or we'll look to close this issue in the next few days.
I met similar problems. Both training loss and val loss is not decreasing.
Could both of you provide more information w.r.t your finetuning configurations & dataset that you're using? As @vatsalaggarwal mentioned, 5-10s should be fine if thats appropriate at inference time. Are either of you able to get a finetuning working with a non-custom dataset (i.e LibriTTS, VCTK)?