flowtron
flowtron copied to clipboard
Should there be any noise output?
I'm having trouble getting any decent results out of flowtron and trying to figure out why. With my somewhat small dataset (0.67hrs) and warmstart from ljs, I can't seem to get anything but noise when doing inference on my checkpoints. I tried warmstart ljs with flow=1 and flow=2. I trained for 240k steps. I've tried adjusting p_arpabet (1.0, 0.5), but no dice. Also tried lowering the learning rate to 1e-5.
It seems I should be getting something other than noise at some point?
pytorch 1.6, python 3.8: noise up to 200k+
pytorch 1.3, python 3.7.4: step 5k: (102400,) noise. step 10k: (9984,) noise, step 20k: (2816,) noise
I know the dataset can't be too bad because deepvoice3 works on it to a reasonable degree...
A bit of an update... I'm training on the LJ dataset and I don't get noise. So something about my dataset is troublesome for flowtron. My data has a lot of shorter utterances like maybe 2-5 words. I also notice that the loss decay was much much faster. -1.0 loss in under 500 steps. LJ isn't even below 0.9 at 100k steps. I also noticed that my wavs are 32bit and LJ are 16bit. My data was magically converted after using librosa's wav writer after trimming silence. Ooops! Retraining now. Hoping for the best.
Another update. Looks like 32bit wav data was my issue. Now I get jibberish output with the model never attending to the text. Attention weights look poor after 1.6m steps similar to https://github.com/NVIDIA/flowtron/issues/41 and others:
Training loss
I wonder if my dataset is too small? I have < 1hr of audio data. Would adding another speaker help? Another difference between my dataset and say, LJS is that my dataset has many more smaller utterances (1-3 words).
I've gone through another pass and cleaned my data checking the transcript and removing things like laughing. At the same time, I'm training another model with this dataset and one more as an additional speaker. This makes almost 2hrs of data.
Yes another update: Still trying to figure out the differences between my dataset and ljs. There are two remaining possibilities that come to mind: utterance length and total dataset size. I trimmed out of my training dataset any sample that was < 1s and > 10s. The min/max distribution now roughly matches ljs. However, even after 345k steps, no attention was learned.
I then created an ljs dataset with only 500 samples (~0.9hrs). Also no attention after 350k steps. Will try again at 1k samples (1.71hrs) and go up to figure out just how much data is required to learn attention on.
LJS with 2500 samples I have attention starting at 85k. here's 185k
please make sure you set the attention prior to True here https://github.com/NVIDIA/flowtron/blob/master/config.json#L34
That seems to have done the trick! The directions for training from scratch seem to apply to pre-trained models as well.
I'm seeing a lot of stuttering in the audio output though. What is typically the cause for this? Need more training time? Data issues? (sigma==0.8)
https://user-images.githubusercontent.com/70453896/111840973-7600f180-88ba-11eb-928f-dba192d5ec90.mp4