ProDiff Model Training

Hi thanks for the repo, wanted to check in with the training metrics. Currently, i am training the ProDiff Teacher model and getting the following validation results. Saw that there are a couple of metrics.

Valid results: {'total_loss': 1.4424, 'ssim': 0.199, 'l1': 0.1656, 'pdur': 0.0264, 'wdur': 0.058, 'sdur': 0.0241, 'uv': 0.685, 'f0': 0.2621, 'e': 0.0222}

Valid results: {'total_loss': 1.7038, 'ssim': 0.1944, 'l1': 0.1393, 'pdur': 0.026, 'wdur': 0.0551, 'sdur': 0.017, 'uv': 1.0061, 'f0': 0.2418, 'e': 0.024}

Valid results: {'total_loss': 1.8448, 'ssim': 0.1891, 'l1': 0.1387, 'pdur': 0.0265, 'wdur': 0.0515, 'sdur': 0.0091, 'uv': 1.1739, 'f0': 0.2328, 'e': 0.0233}

Which metrics should i note if there is an improvement over time? It seems that the total_loss is increasing?
What is a reasonable loss for finetuning? I am currently running off the checkpoints of the pre-trained model provided by you guys.
Also, noted that after training the ProDiff teacher model, i need to run the training using ProDiff yaml. Is there any chance we need to train the FastDiff model too?

Also, I am training on a dataset of 20mins if it matters.

Thanks a lot!

Sep 25 '22 04:09 keelezibel

Hi, you may find l1 in training schedules useful to judge the learning or fine-tuning, and please do not use the metrics in validation schedules. Secondly, it would be fine to tune the fastdiff model too, which will promote the audio quality.

Oct 06 '22 09:10 Rongjiehuang

Hi @Rongjiehuang, I am back with some results better than the last run.

Reference Audio Synthesized Audio - 26k epochs Synthesized Audio - 100k epochs

I have trained this model up to 26k steps by finetuning on your released ProDiff-Teacher model. The current l1 training loss is 0.0107. I noticed there are still artefacts in the synthesized voice compared to the reference audio. Currently, I am training on a relatively smaller dataset of around 20mins compared to ~20H for LJSpeech.

I am thinking there could be several options here:

Increase the number of training epochs
Increase dataset size
Finetune the Prodiff as well
Look into how to finetune the fastdiff model as you had mentioned

Also, how does the audio quality of the ProDiff Teacher compare to the ProDiff output?

Thanks alot for releasing this repo.

Oct 17 '22 08:10 keelezibel

@keelezibel I got the total_loss = 0! Did you also get the training loss 0?

Jul 12 '23 08:07 AIFahim

@AIFahim no i didn't get zero for training loss. You can refer to my first screenshot. I eventually settled for PaddleSpeech TTS as I couldnt figure out how to improve the quality of the voice further. Also, there were too many dependencies on other existing projects.

Jul 16 '23 06:07 keelezibel

ProDiff ProDiff copied to clipboard

Model Training

ProDiff
ProDiff copied to clipboard