MeloTTS icon indicating copy to clipboard operation
MeloTTS copied to clipboard

Training from Scratch Yielding Unusable Results

Open BankNatchapol opened this issue 1 year ago • 5 comments

Hello. I've been working on training a model from scratch using approximately 300 hours of 22kHz audio data. However, I've encountered some problems. In my language, the phenomizer isn't stable, so I've made modifications to the training script to make it character-based instead. Despite these adjustments, the results of my training have been disappointing; the model only seems to produce random noise.

image

Below are the losses. If you've got any ideas or tips on how to rescue my poor model from its noisy fate, I'd be incredibly grateful. image image image image

BankNatchapol avatar Mar 26 '24 12:03 BankNatchapol

If you only look at the loss/g/total curve(in my fine-tune case), it seems quite normal. Could you please provide your config.json and also mention what other modifications you have made in the program?

jeremy110 avatar Mar 29 '24 01:03 jeremy110

If you only look at the loss/g/total curve(in my fine-tune case), it seems quite normal. Could you please provide your config.json and also mention what other modifications you have made in the program?

Thanks for replying. Here's my config.json. I modified only on the text parts (g2p, symbols, bert). config_2.json

BankNatchapol avatar Mar 30 '24 09:03 BankNatchapol

I have seen the config.json file, and you didn't add num_languages and num_tones there? Actually, my fine-tuning process was similar to yours. I trained a new language using IPA, and there were some symbols not present in the original config.json, so like you, I replaced some of the symbols with my own. Additionally, I made a mistake initially by directly changing the symbols in config.json, but during training, it reads symbols.py, inference is reading config.json, causing inconsistency between the two. Consequently, the model couldn't understand the sounds properly.

jeremy110 avatar Mar 31 '24 03:03 jeremy110

Hello. I've been working on training a model from scratch using approximately 300 hours of 22kHz audio data. However, I've encountered some problems. In my language, the phenomizer isn't stable, so I've made modifications to the training script to make it character-based instead. Despite these adjustments, the results of my training have been disappointing; the model only seems to produce random noise.

image Below are the losses. If you've got any ideas or tips on how to rescue my poor model from its noisy fate, I'd be incredibly grateful. image image image image

might i ask why 22KHz and not 44100? is there a particular reason?

ZeaMays14142 avatar Feb 21 '25 18:02 ZeaMays14142

what are the modification need to do for train the model from scratch And how many hours of data required for it?

Fariq-22 avatar Apr 17 '25 10:04 Fariq-22