IMS-Toucan
IMS-Toucan copied to clipboard
Overfitting - how to detect and stop training?
I have an issue with overfitting on the data which seems to degrade the Cherokee portion of the output.
The Cherokee output starts dropping trailing syllables that start with an 'h' in later iterations, which are rendered OK in earlier iterations.
I've been trying higher iterations to get better voice matching between samples and model for data set specific voices.
Is there a way to get the loss on a per language basis?
I'm currently retraining the aligner with the Cherokee audio sourced from tape removed. I will then train the TTS again to see if that helps any.
I suppose this question also applies to the aligner.
I recommend not training from scratch on your low resource data, and instead only using it for finetuning. The idea of the Meta model is to train on a lot of data from resource-rich languages and then taking this trained model and finetuning it on the low-resource language. Best results in the low-resource language are achieved if you finetune only on this language, but then the model loses its multilinguality, the other languages degrade. To prevent that you can do this joint training that you are doing now with high and low resource data mixed, but then you get slightly worse performance in the low-resource language.
In any case, you should pretrain for a longer time with only the high-resource data and then finetune for around 30k steps with the low-resource data (depending on how many minutes of low-resource data you have, the more data, the more steps you can go). And once the pretrained tone model is ready, the idea is that anyone who wants to apply this to low-resource can skip the first step and just download it and directly finetune.
For the aligner it's ok to overfit the ASR objective a little bit, so there the steps don't need to be considered as carefully. Finetuning the pretrained aligner for ~5 epochs on the low resource data should be fine either way.
I "augmented" my data with the following script: (https://github.com/CherokeeLanguage/cherokee-audio-data/blob/main/create_augmented.py)
Adding in the longer combined sequences has greatly enhanced the quality of the output and so far the loss of syllables and doubling s sounds has not occurred.
New pretrained models are available now, maybe with those the overfitting issues can be fixed? If you finetune a pretrained model on your data for just a few epochs, it should achieve peak performance before overfitting starts to set in.
Sorry I haven't responded in a while. Family and health issues.
I tracked down part of the "over fitting" to be caused by the aligner. It seems that the aligner needed retraining on the exact data I was working with.
I've not had time yet to write out the code needed to fine tune the new model to see how they compare.
However, I do have a working model now, with several acceptable voices.
Sorry to hear that, glad you're doing better now!
The aligner is finetuned automatically, whenever you're creating a new fastspeech dataset and there is not an already existing aligner model in the corresponding cache directory. If the following argument to the corpus preparation function is true (which it is by default) it will finetune the aligner for 5 epochs if a general aligner model is available or train one from scratch for 50 epochs if there is no general aligner checkpoint to finetune from.
https://github.com/DigitalPhonetics/IMS-Toucan/blob/702a8d98c5d2e1b28d3eb9ec0649ca7647f12bcc/Utility/corpus_preparation.py#L19
There are also some plots saved for debugging in the corresponding Corpora directory. If you re-extracted your features since the most recent release (which broke backward compatibility with old caches and checkpoints) finetuning should have already happened. Maybe it needs to run for longer on your data? There's no argument for that, for now you'd need to change that directly in the code.
https://github.com/DigitalPhonetics/IMS-Toucan/blob/702a8d98c5d2e1b28d3eb9ec0649ca7647f12bcc/Utility/corpus_preparation.py#L39