flowtron
flowtron copied to clipboard
Training on single speaker (male) hindi dataset - unable to attend (flow=1)
Continuing the conversation from this somewhat related issue - https://github.com/NVIDIA/flowtron/issues/39 - but opening a new issue since I my model is unable to attend even with 1 flow.
My issue some also somewhat similar to https://github.com/NVIDIA/flowtron/issues/41 - I have now trained my model for 790,000 steps. Validation loss seems to have hit a minima at around 360k steps at which point attention was biased, and further training made attention vanish and validation loss slowly increase.
Attached below are attention plots for steps=215k, 360k, 790k; and the validation loss.
I am wondering how to proceed. The options I'm considering are:
- increase flow=2 and warmstart with checkpoint 790k.
- increase flow=2 and warmstart with checkpoint 360k. I deleted that checkpoint to save some diskspace and am now regretting it - I'll have to restart training and get to 360k steps again.
- Train a new tacotron2 model using my hindi dataset and warmstart flowtron using that.
@rafaelvalle Would love some advice at this point.
Also attaching some inference files below. The speech is senseless though.
sid0_sigma0.5.wav.zip
Update: I trained tacotron2 with the same data - got good alignment and speech generation. Used that model to warmstart flowtron, but got very little/mild attention even at 500k steps. Decided to warmstart on LJS, hoping I get better results with that.
Warmstarting on LJS didn't work. I cleaned my dataset again, removing silences from start and end. Trying warmstarting off Tacotron model again.
An update, since this issue has been open for a long time. My model learned to attend, kinda. It still has issues during inference and I've been playing around with the inference config, but at least I'm seeing some attention.
Solution:
- I created a new 3-speaker dataset
- I cleaned it diligently, removing all silence from the beginning and end.
- I also removed long silences from the middle of the audio clips.
- CAVEAT: The transcriptions of the audio are not 100% accurate. There are a few wrong but similar-sounding transcriptions.
All three speakers have around 14 hours of data, yet some speakers attend better than others.
Here are the graphs, and some samples.




I'm now using my model from 920k iterations to warmstart a 2flow model. Hoping it improves attention and the quality of inference.
@astricks thank you for letting us know cleaning the data has helped the model learn attention. Can you please share how the attention looks at 200k iterations?
@rafaelvalle attaching attention plot for earlier iterations



Great. I suggest resuming from the model with 200k iters given that it has better generalization loss and less bias in the attention map.
Gotcha, thanks! Just warmstarted with 200k iteration checkpoint and n_flows=2.
An update, since this issue has been open for a long time. My model learned to attend, kinda. It still has issues during inference and I've been playing around with the inference config, but at least I'm seeing some attention.
Solution:
- I created a new 3-speaker dataset
- I cleaned it diligently, removing all silence from the beginning and end.
- I also removed long silences from the middle of the audio clips.
- CAVEAT: The transcriptions of the audio are not 100% accurate. There are a few wrong but similar-sounding transcriptions.
All three speakers have around 14 hours of data, yet some speakers attend better than others.
Here are the graphs, and some samples.
![]()
![]()
![]()
![]()
I'm now using my model from 920k iterations to warmstart a 2flow model. Hoping it improves attention and the quality of inference.
how did you trim silence of the start and end? liborosa.trim or other method? have you got attention with only ljs corpus? In the paper, 3 dataset also used to attention with n_flow=1, then n_flow=2. However when I only ues the ljs corpus, I cann't get attention when n_flow=1, even warm start from tacotron2 model which has good attention, and I use librosa to trim silence with top_db=30