IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Option to generate audio file to hear how the training evolves

Open Ca-ressemble-a-du-fake opened this issue 2 years ago • 2 comments

Hi,

I haven't found the option to generate audio files every now and then to check whether the training is evolving and to prevent overfitting.

In weight and bias website or on disk only the mel spectrograms are available. I find it great if it was also possible to have audio files of the test sentences.

If it slows down the training too much then an option should enable the generation of audio.

I know that I can workaround this lack by merging the last checkpoints and then loading the checkpoint to infer the test sentences, but I find this process cumbersome. And this sometimes causes the training to stop (maybe because of out of memory error).

If needed I can help implement this feature!

Ca-ressemble-a-du-fake avatar Dec 14 '22 09:12 Ca-ressemble-a-du-fake

Hi! The problem with this is that creating the actual audio requires an already trained vocoder model. From the start of the toolkit there was no such pretrained vocoder model, so I did not want to assume that such a model already exists when training the TTS. Now there is the vocoder model in the releases which could be used for this. However, using different versions of the vocoder etc would lead to differences and make audios across runs misleading, because changes might come from the vocoder, not the underlying TTS. Only if the same vocoder checkpoint is always used, different runs could be compared. And for the progress within a run, I believe the spectrogram is sufficient to see if there are any problems. TTS models don't really overfit in the classical sense, so an audio might actually be misleading. Because of all of those considerations I decided against creating audios during training at some point in the past. I put it on the list of possible features for a future version. If it's optional and turned off by default, it would probably be ok, so I might do it for the next release or the one after.

Flux9665 avatar Dec 20 '22 21:12 Flux9665

Thanks for your reply! An alternative could be to just create a python script that takes as input a mel spectrogram or a directory of mel spectrograms (ie the directory where the pngs are saved) and outputs the corresponding wav.

If the pngs are not enough detailed then the mel spectrogram data points could be saved as text files along with the pngs and be loaded later by the script defined above. The creation of theses text files could be an option (ie --save_mel_spect_as_txt False by default).

This approach has the benefit of keeping vocoder code separate from the tts code.

Ca-ressemble-a-du-fake avatar Dec 21 '22 04:12 Ca-ressemble-a-du-fake