ProDiff
ProDiff copied to clipboard
Model Training
Hi thanks for the repo, wanted to check in with the training metrics. Currently, i am training the ProDiff Teacher model and getting the following validation results. Saw that there are a couple of metrics.
Valid results: {'total_loss': 1.4424, 'ssim': 0.199, 'l1': 0.1656, 'pdur': 0.0264, 'wdur': 0.058, 'sdur': 0.0241, 'uv': 0.685, 'f0': 0.2621, 'e': 0.0222}
Valid results: {'total_loss': 1.7038, 'ssim': 0.1944, 'l1': 0.1393, 'pdur': 0.026, 'wdur': 0.0551, 'sdur': 0.017, 'uv': 1.0061, 'f0': 0.2418, 'e': 0.024}
Valid results: {'total_loss': 1.8448, 'ssim': 0.1891, 'l1': 0.1387, 'pdur': 0.0265, 'wdur': 0.0515, 'sdur': 0.0091, 'uv': 1.1739, 'f0': 0.2328, 'e': 0.0233}
- Which metrics should i note if there is an improvement over time? It seems that the total_loss is increasing?
- What is a reasonable loss for finetuning? I am currently running off the checkpoints of the pre-trained model provided by you guys.
- Also, noted that after training the ProDiff teacher model, i need to run the training using ProDiff yaml. Is there any chance we need to train the FastDiff model too?
Also, I am training on a dataset of 20mins if it matters.
Thanks a lot!
Hi, you may find l1
in training schedules useful to judge the learning or fine-tuning, and please do not use the metrics in validation schedules. Secondly, it would be fine to tune the fastdiff model too, which will promote the audio quality.
Hi @Rongjiehuang, I am back with some results better than the last run.
Reference Audio Synthesized Audio - 26k epochs Synthesized Audio - 100k epochs
I have trained this model up to 26k steps by finetuning on your released ProDiff-Teacher model. The current l1
training loss is 0.0107. I noticed there are still artefacts in the synthesized voice compared to the reference audio. Currently, I am training on a relatively smaller dataset of around 20mins compared to ~20H for LJSpeech.
I am thinking there could be several options here:
- Increase the number of training epochs
- Increase dataset size
- Finetune the Prodiff as well
- Look into how to finetune the fastdiff model as you had mentioned
Also, how does the audio quality of the ProDiff Teacher compare to the ProDiff output?
Thanks alot for releasing this repo.
@keelezibel I got the total_loss = 0! Did you also get the training loss 0?
@AIFahim no i didn't get zero for training loss. You can refer to my first screenshot. I eventually settled for PaddleSpeech TTS as I couldnt figure out how to improve the quality of the voice further. Also, there were too many dependencies on other existing projects.