flowtron icon indicating copy to clipboard operation
flowtron copied to clipboard

Training on single speaker (male) hindi dataset - unable to attend (flow=1)

Open astricks opened this issue 4 years ago • 8 comments

Continuing the conversation from this somewhat related issue - https://github.com/NVIDIA/flowtron/issues/39 - but opening a new issue since I my model is unable to attend even with 1 flow.

My issue some also somewhat similar to https://github.com/NVIDIA/flowtron/issues/41 - I have now trained my model for 790,000 steps. Validation loss seems to have hit a minima at around 360k steps at which point attention was biased, and further training made attention vanish and validation loss slowly increase.

Attached below are attention plots for steps=215k, 360k, 790k; and the validation loss. attn215k Attn360k attn790k valLoss

I am wondering how to proceed. The options I'm considering are:

  1. increase flow=2 and warmstart with checkpoint 790k.
  2. increase flow=2 and warmstart with checkpoint 360k. I deleted that checkpoint to save some diskspace and am now regretting it - I'll have to restart training and get to 360k steps again.
  3. Train a new tacotron2 model using my hindi dataset and warmstart flowtron using that.

@rafaelvalle Would love some advice at this point.

Also attaching some inference files below. The speech is senseless though. sid0_sigma0 5_attnlayer0 sid0_sigma0.5.wav.zip

astricks avatar Aug 03 '20 16:08 astricks

Update: I trained tacotron2 with the same data - got good alignment and speech generation. Used that model to warmstart flowtron, but got very little/mild attention even at 500k steps. Decided to warmstart on LJS, hoping I get better results with that.

astricks avatar Aug 13 '20 19:08 astricks

Warmstarting on LJS didn't work. I cleaned my dataset again, removing silences from start and end. Trying warmstarting off Tacotron model again.

astricks avatar Aug 18 '20 03:08 astricks

An update, since this issue has been open for a long time. My model learned to attend, kinda. It still has issues during inference and I've been playing around with the inference config, but at least I'm seeing some attention.

Solution:

  1. I created a new 3-speaker dataset
  2. I cleaned it diligently, removing all silence from the beginning and end.
  3. I also removed long silences from the middle of the audio clips.
  4. CAVEAT: The transcriptions of the audio are not 100% accurate. There are a few wrong but similar-sounding transcriptions.

All three speakers have around 14 hours of data, yet some speakers attend better than others.

Here are the graphs, and some samples.

Screen Shot 2020-10-05 at 10 07 47 AM Screen Shot 2020-10-05 at 10 08 15 AM Screen Shot 2020-10-05 at 10 07 24 AM Screen Shot 2020-10-05 at 10 08 35 AM

I'm now using my model from 920k iterations to warmstart a 2flow model. Hoping it improves attention and the quality of inference.

astricks avatar Oct 05 '20 14:10 astricks

@astricks thank you for letting us know cleaning the data has helped the model learn attention. Can you please share how the attention looks at 200k iterations?

rafaelvalle avatar Oct 05 '20 18:10 rafaelvalle

@rafaelvalle attaching attention plot for earlier iterations

Screen Shot 2020-10-05 at 5 46 37 PM Screen Shot 2020-10-05 at 5 46 24 PM Screen Shot 2020-10-05 at 5 48 53 PM

astricks avatar Oct 05 '20 21:10 astricks

Great. I suggest resuming from the model with 200k iters given that it has better generalization loss and less bias in the attention map.

rafaelvalle avatar Oct 05 '20 22:10 rafaelvalle

Gotcha, thanks! Just warmstarted with 200k iteration checkpoint and n_flows=2.

astricks avatar Oct 05 '20 22:10 astricks

An update, since this issue has been open for a long time. My model learned to attend, kinda. It still has issues during inference and I've been playing around with the inference config, but at least I'm seeing some attention.

Solution:

  1. I created a new 3-speaker dataset
  2. I cleaned it diligently, removing all silence from the beginning and end.
  3. I also removed long silences from the middle of the audio clips.
  4. CAVEAT: The transcriptions of the audio are not 100% accurate. There are a few wrong but similar-sounding transcriptions.

All three speakers have around 14 hours of data, yet some speakers attend better than others.

Here are the graphs, and some samples.

Screen Shot 2020-10-05 at 10 07 47 AM Screen Shot 2020-10-05 at 10 08 15 AM Screen Shot 2020-10-05 at 10 07 24 AM Screen Shot 2020-10-05 at 10 08 35 AM

I'm now using my model from 920k iterations to warmstart a 2flow model. Hoping it improves attention and the quality of inference.

how did you trim silence of the start and end? liborosa.trim or other method? have you got attention with only ljs corpus? In the paper, 3 dataset also used to attention with n_flow=1, then n_flow=2. However when I only ues the ljs corpus, I cann't get attention when n_flow=1, even warm start from tacotron2 model which has good attention, and I use librosa to trim silence with top_db=30

tricky61 avatar Nov 17 '20 03:11 tricky61