flowtron Training on single speaker (male) hindi dataset

Continuing the conversation from this somewhat related issue - https://github.com/NVIDIA/flowtron/issues/39 - but opening a new issue since I my model is unable to attend even with 1 flow.

My issue some also somewhat similar to https://github.com/NVIDIA/flowtron/issues/41 - I have now trained my model for 790,000 steps. Validation loss seems to have hit a minima at around 360k steps at which point attention was biased, and further training made attention vanish and validation loss slowly increase.

Attached below are attention plots for steps=215k, 360k, 790k; and the validation loss. attn215k Attn360k attn790k valLoss

I am wondering how to proceed. The options I'm considering are:

increase flow=2 and warmstart with checkpoint 790k.
increase flow=2 and warmstart with checkpoint 360k. I deleted that checkpoint to save some diskspace and am now regretting it - I'll have to restart training and get to 360k steps again.
Train a new tacotron2 model using my hindi dataset and warmstart flowtron using that.

@rafaelvalle Would love some advice at this point.

Also attaching some inference files below. The speech is senseless though. sid0_sigma0 5_attnlayer0 sid0_sigma0.5.wav.zip

Aug 03 '20 16:08 astricks

Update: I trained tacotron2 with the same data - got good alignment and speech generation. Used that model to warmstart flowtron, but got very little/mild attention even at 500k steps. Decided to warmstart on LJS, hoping I get better results with that.

Aug 13 '20 19:08 astricks

Warmstarting on LJS didn't work. I cleaned my dataset again, removing silences from start and end. Trying warmstarting off Tacotron model again.

Aug 18 '20 03:08 astricks

An update, since this issue has been open for a long time. My model learned to attend, kinda. It still has issues during inference and I've been playing around with the inference config, but at least I'm seeing some attention.

Solution:

I created a new 3-speaker dataset
I cleaned it diligently, removing all silence from the beginning and end.
I also removed long silences from the middle of the audio clips.
CAVEAT: The transcriptions of the audio are not 100% accurate. There are a few wrong but similar-sounding transcriptions.

All three speakers have around 14 hours of data, yet some speakers attend better than others.

Here are the graphs, and some samples.

I'm now using my model from 920k iterations to warmstart a 2flow model. Hoping it improves attention and the quality of inference.

Oct 05 '20 14:10 astricks

@astricks thank you for letting us know cleaning the data has helped the model learn attention. Can you please share how the attention looks at 200k iterations?

Oct 05 '20 18:10 rafaelvalle

@rafaelvalle attaching attention plot for earlier iterations

Oct 05 '20 21:10 astricks

Great. I suggest resuming from the model with 200k iters given that it has better generalization loss and less bias in the attention map.

Oct 05 '20 22:10 rafaelvalle

Gotcha, thanks! Just warmstarted with 200k iteration checkpoint and n_flows=2.

Oct 05 '20 22:10 astricks

An update, since this issue has been open for a long time. My model learned to attend, kinda. It still has issues during inference and I've been playing around with the inference config, but at least I'm seeing some attention.

Solution:

I created a new 3-speaker dataset

I cleaned it diligently, removing all silence from the beginning and end.

I also removed long silences from the middle of the audio clips.

CAVEAT: The transcriptions of the audio are not 100% accurate. There are a few wrong but similar-sounding transcriptions.

All three speakers have around 14 hours of data, yet some speakers attend better than others.

Here are the graphs, and some samples.

I'm now using my model from 920k iterations to warmstart a 2flow model. Hoping it improves attention and the quality of inference.

how did you trim silence of the start and end? liborosa.trim or other method? have you got attention with only ljs corpus? In the paper, 3 dataset also used to attention with n_flow=1, then n_flow=2. However when I only ues the ljs corpus, I cann't get attention when n_flow=1, even warm start from tacotron2 model which has good attention, and I use librosa to trim silence with top_db=30

Nov 17 '20 03:11 tricky61

flowtron
flowtron copied to clipboard

Training on single speaker (male) hindi dataset - unable to attend (flow=1)

flowtron flowtron copied to clipboard

Training on single speaker (male) hindi dataset - unable to attend (flow=1)

flowtron
flowtron copied to clipboard