metavoice-src icon indicating copy to clipboard operation
metavoice-src copied to clipboard

What is required audio length for fine tuning?

Open risqaliyevds opened this issue 1 year ago • 4 comments
trafficstars

I split my audio into 5-10 second chunks. Is this normal for fine-tuning, or is there a specific range for audio chunks? I fine-tuned with my Uzbek language audio (approximately 30 hours, and my loss is not decreasing

risqaliyevds avatar Mar 28 '24 07:03 risqaliyevds

Hey! 5-10 seconds should be enough, but note that during synthesis you'll struggle to generate more than 5-10 seconds at one time due to this...

hard to debug loss not decreasing without more info!

vatsalaggarwal avatar Mar 30 '24 15:03 vatsalaggarwal

Hey @risqaliyevds, let us know if you have anymore info or we'll look to close this issue in the next few days.

lucapericlp avatar Apr 03 '24 09:04 lucapericlp

I met similar problems. Both training loss and val loss is not decreasing.

eshoyuan avatar May 01 '24 22:05 eshoyuan

Could both of you provide more information w.r.t your finetuning configurations & dataset that you're using? As @vatsalaggarwal mentioned, 5-10s should be fine if thats appropriate at inference time. Are either of you able to get a finetuning working with a non-custom dataset (i.e LibriTTS, VCTK)?

lucapericlp avatar May 14 '24 21:05 lucapericlp