jeremy110
jeremy110
If you only look at the loss/g/total curve(in my fine-tune case), it seems quite normal. Could you please provide your config.json and also mention what other modifications you have made...
I have seen the config.json file, and you didn't add num_languages and num_tones there? Actually, my fine-tuning process was similar to yours. I trained a new language using IPA, and...
We usually look at g/total, and from your graph, it seems to be decreasing pretty well. But I’m not sure if 2 hours of training data is enough; I initially...
@smlkdev Basically, this training can be kept short since it’s just a fine-tuning session; no need to make it too long. Here’s my previous tensorboard log for your reference(https://github.com/myshell-ai/MeloTTS/issues/120#issuecomment-2105728981). I...
Yes, that's correct. I tried both single-speaker and multi-speaker models, and the total duration is around 8-10 hours. If this is your first time getting into it, I recommend you...
In terms of quality, I think F5-TTS is quite good. You can try it out on the Huggingface demo. The pauses within sentences mainly depend on your commas (","). The...
@manhcuong17072002 Hello~ In my training, some speakers had 1 or 2 hours of audio, while others had 30 minutes, and in the end, there were about 10 hours of total...
@manhcuong17072002 Yes, there are about 15 speakers. Of course, if you have enough people, you can continue to increase the number. After 10 hours, the voice quality is quite close,...
@manhcuong17072002 You're welcome, your conclusion is correct. Normally, during training, long audio files are avoided to prevent GPU OOM (Out of Memory) issues. Therefore, during inference, punctuation marks are typically...
@manhcuong17072002 If we consider 30 minutes of audio, assuming each word takes about 0.3 seconds, there would be around 5000–6000 words. These words would then be converted into phoneme format,...