wav2letter unstable training

Hi

I used Mozilla Common Voice dataset (whole validated data for Persian language, which is about 211 hounrs) to train sota models. I almost used librispeech sota config files (except little changes corresponding to refrain using word piece and use surround flag instead + disable distributed training) to train resnet and TDS models using CTC loss. I also reduced resnet channels and TDS unites hidden sizes as the dataset is about 1/5 of librispeech in size. I also eliminate data augmentation.

The training process is goes well for a while and then the loss/TER/WER becomes unstable and goes to the ceiling:

The resnet model (orange curve) after 5 days of training and the TDS model (blue curve) after about 1 day of training become unstable. Is there any reason cause the problem? Is the loss or layer norm computationally unstable? And how can I avoid these kinds of unstability?

The TDS model arch:

V -1 NFEAT 1 0
C2 1 10 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.1 2400
TDS 10 21 80 0.1 2400
C2 10 14 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
TDS 14 21 80 0.15 3000
C2 14 18 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.15 3600
TDS 18 21 80 0.2 3600
TDS 18 21 80 0.2 3600
TDS 18 21 80 0.25 3600
TDS 18 21 80 0.25 3600
TDS 18 21 80 0.25 3600
TDS 18 21 80 0.25 3600
V 0 1440 1 0
RO 1 0 3 2
L 1440 NLABEL

and config file parameters:

--batchsize=4
--lr=0.3
--momentum=0.5
--maxgradnorm=1
--onorm=target
--sqnorm=true
--mfsc=true
--nthread=10
--criterion=ctc
--memstepsize=8338608
#--wordseparator=_
#--usewordpiece=true
--surround=|
--filterbanks=80
--gamma=0.5
#--enable_distributed=true
--iter=1500
--stepsize=200
--framesizems=30
--framestridems=10
--seed=2
--reportiters=1000

regargs.

Mar 25 '20 08:03 mohamad-hasan-sohan-ajini

i got same Issues, i try addition learn rate,seems to improve,but still not resolved.

Mar 28 '20 03:03 jkkj1630

Hi, We have not seen this issue lately. For a sanity check, could you run an experiment where you filter audio samples < 1sec and target length < 5.

Mar 28 '20 17:03 vineelpratap

Also, could you let us know the input and target size distribution - min, max, avg, stddev

Mar 28 '20 18:03 vineelpratap

sample wav at 5-30 sec

Mar 28 '20 23:03 jkkj1630

Also, could you let us know the input and target size distribution - min, max, avg, stddev

Ah, I forgot to filter common voice data (as I did for our dataset) to have bounded length. So there are 2 samples longer than 15 seconds (19.824 and 24.864) and they may cause training instability.

input size distribution is as follows: min: 0.744, max: 24.864, avg: 3.950, std: 1.534 input

target size distribution is as follows: min: 2, max: 197, avg: 31.389, std: 17.039 output

But these samples were in the dataset from the first epoch. That is weird that they make the network unstable after about 200 epochs!

Mar 29 '20 07:03 mohamad-hasan-sohan-ajini

I tried training on 18,000 hours of Chinese data from many scenes, including phone recordings, news subtitles, TTS voice-changing synthesis, standard reading, wake-up words, multi-person conversations and meeting records. The labeling accuracy of these data is> 95%. Using https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/streaming_convnets/librispeech/am_500ms_future_context.arch, do I need to make any changes to this original sample arch file?

Mar 30 '20 13:03 jkkj1630

@jkkj1630 Currently I'm training a model with the same arch file and don't see any instability yet. As the issue is not reproducible, I'll close it. But definitely there are some instability issue cause 25 second long audio should not make the training unstable.

Apr 04 '20 07:04 mohamad-hasan-sohan-ajini

I get the same issue when training with duration filtered files:

Training process is stopped with NaN loss:

F0424 23:20:16.014863 218 Train.cpp:564] Loss has NaN values. Samples - common_voice_fa_19219740.mp3_norm,common_voice_fa_19219740.mp3_lowgain

while the duration of both files are 4.632 seconds. It seems the problem is waveform duration agnostic and happens in a not predictable manner.

Apr 25 '20 14:04 mohamad-hasan-sohan-ajini

Hey, sorry for the non-related question, can I know how you use tensoboard for monitoring the loss?Thank you for your answer. Also I face the same problem with Nan in loss values.

May 04 '20 19:05 junaedifahmi

@juunnn, some time ago we shared the script to convert the w2l logs into tensorboard format here https://github.com/facebookresearch/wav2letter/issues/528.

May 05 '20 16:05 tlikhomanenko

thank you @tlikhomanenko

May 07 '20 18:05 junaedifahmi

optimizer is important in training stability. SGD+grad_clip - is a very stable option

Sep 30 '21 09:09 ali-r