flowtron Training help

I've created a Training set based on LibriSpeech with two additional speakers. Ive pruned out speakers that have a small ammount of data, as well as the samples that are very short or very long, similar to the description of your filelists. I've used Librosa to trim the silences at the beginning and end, I've converted the byte call to bool as required for later versions of torch. I also added an automated learning rate annealer that currently reduces the learning rate by half every 30K. I'm warm starting off your LJS model.

I've done a number of runs and they all seem to diverge rapidly at about 60K, regardless of learning rate. The training loss continues to drop albeit very slowly, the validation rises fairly rapidly to plateau at about -2. The attention looks pretty much like TV static.

Unfortuately I don't have a way at this time to show you my plots.

I have had some luck with another dataset based on LibriTTS

any thoughts would be appreciated

Oct 05 '20 19:10 pneumoman

Have you trimmed silences from beginning and end of the audio files?

Oct 05 '20 19:10 rafaelvalle

Yes, I added librosa.trim into the call load_wav_to_torch - I had done this earlier to Mellotron. I may need to do some testing of the threshold, as I think I've got it out adjustment in my frustration. In the other run I mentioned, I've left it running for a while, and the validation loss has continued to rise, but the attention graphs have started to get definition. Also I can show those to you 8^)

Oct 05 '20 23:10 pneumoman

You can stop training once you confirm that the validation loss is clearly going up. In this model above, is the attention good around 400k right before you decrease the learning rate? Are you training from scratch or are you training from our pre-trained model?

Oct 05 '20 23:10 rafaelvalle

I started with the LJS pretrained model. At 400K the attention was not complete, flow zero was almost a blank screen. My other model has similar curves, but no attention.

Oct 06 '20 03:10 pneumoman

Please try starting with the LJSpeech model and training on LJS and your data. Copy the value of the LJ speaker embeddings to the 0-th embedding of the new speaker embedding. You can probably go directly to the model with 2 steps of flow.

Oct 06 '20 03:10 rafaelvalle

I will try that. In my flusteration, I noticed that with one GPU, that the validation code will always be working on one specific example, so you may not be getting a good feeling on how well your model is doing overall. I modified the validation to take a random sample of ten items. when I did this, the validation loss of my problematic model no longer began to rise early, still need to see if the model attends, but at least I don't feel like the validation is misleading.

Oct 07 '20 13:10 pneumoman

setting the attention prior to True will help the model learn attention must faster and allows for training at least 2 steps of flow at the same time. https://github.com/NVIDIA/flowtron/blob/master/config.json#L34

Mar 16 '21 23:03 rafaelvalle

flowtron flowtron copied to clipboard

Training help

flowtron
flowtron copied to clipboard