DeepSpeech Training performance drop after augmentation refactoring

I have the feeling that the training performance, duration and accuracy, got worse after the augmentation refactoring commits.

Before the training of my dataset took about 2:10h and today it took 3:20h, with about the same number of epochs. At the beginning one epoch takes about 5min, but after epoch 18 they suddenly need 8min on average. I didn't see this behaviour in the trainings two days ago.

Also the accuracy got a bit worse:

Dataset	Additional Infos	Losses	Training epochs of best model	Result
Voxforge		Test: 32.844025, Validation: 36.912005	14	WER: 0.240091, CER: 0.087971
Voxforge	without freq_and_time_masking augmentation	Test: 33.698494, Validation: 38.071722	10	WER: 0.244600, CER: 0.094577
Voxforge	using new audio augmentation options (AUG_AUDIO code1)	Test: 29.280865, Validation: 33.294815	21	WER: 0.220538, CER: 0.079463
Voxforge	after refactoring	Test: 33.317413, Validation: 38.678969	20	WER: 0.243480, CER: 0.088640

This were the options I set before:

AUG_PITCH_TEMPO="--augmentation_pitch_and_tempo_scaling \
                   --augmentation_pitch_and_tempo_scaling_min_pitch 0.98 \
                   --augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
                   --augmentation_pitch_and_tempo_scaling_max_tempo 1.2"
AUG_ADD_DROP="--data_aug_features_additive 0.2 \
                --augmentation_spec_dropout_keeprate 0.95"
AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
AUG_AUDIO="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment gaps[p=0.05,n=1:3~2,size=10:100] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000] \
      --augment volume[p=0.1,dbfs=-10:-40]"

And those I used for my todays run:

    AUG_AUDIO="--augment volume[p=0.1,dbfs=-10:-40] \
      --augment pitch[p=0.1,pitch=1.1~0.95] \
      --augment tempo[p=0.1,factor=1.25~0.75]"
    AUG_ADD_DROP="--augment dropout[p=0.1,rate=0.05] \
      --augment add[p=0.1,domain=signal,stddev=0~0.5]"
    AUG_FREQ_TIME="--augment frequency_mask[p=0.1,n=1:3,size=1:5] \
      --augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40]"
    AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000]"

I also had to reduce the batch size from 30 to 24 because I got the error #3088. About two month ago I could use 36 without any problems.

I know there is a bit randomness in the accuracy and I did change some of the augmentation params slightly but the change in results is bigger than expected.

@tilmankamp do you have an idea about this?

Jun 19 '20 17:06 DanBmh

I did run another test with the code directly before the refactoring (#188a6f2c1ee53dc79acf8abceaf729b5f9a05e7a).

This time one epoch takes 4min on average and the whole training took 1:45h.

Dataset	Additional Infos	Losses	Training epochs of best model	Result
Voxforge		Test: 28.846869, Validation: 32.680268	16	WER: 0.225360, CER: 0.083504

I now used batch size of 24 and did update the params again to better match the params above:

  AUG_AUDIO="--augmentation_pitch_and_tempo_scaling \
                   --augmentation_pitch_and_tempo_scaling_min_pitch 0.95 \
                   --augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
                   --augmentation_pitch_and_tempo_scaling_max_tempo 1.25"
  AUG_ADD_DROP="--data_aug_features_additive 0.25 \
                --augmentation_spec_dropout_keeprate 0.95"
  AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
  AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment gaps[p=0.05,n=1:3~2,size=10:100] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000] \
      --augment volume[p=0.1,dbfs=-10:-40]"

Jun 19 '20 20:06 DanBmh

@DanBmh Augmentations volume gaps reverb codec resample and overlay are most likely not responsible for this discrepancy as their implementations have not been changed during refactoring. For the others it'd be helpful to compare them one by one with their former implementations to get a better understanding of the problem. I'll do some performance tests here.

Jun 22 '20 09:06 tilmankamp

My observations so far:

Regarding batch size: I got the biggest difference to the old implementation when switching from the former combined --augmentation_pitch_and_tempo_scaling to --augment pitch plus --augment tempo. This additional memory requirement comes from the doubling of certain allocations as the involved ops are not part of one augmentation sub-graph anymore. With the refactored code I had to decrease BS from 38 to 35 to get it working.
The new internal clock tensor has some very small overhead that should in most cases not require a BS adjustment.
The way how dropout is implemented now seems to require slightly more memory, as there is a tensor of random values allocated that is of the same size as the augmentation target.
At least when comparing dropout (more tests needed) there was no difference in regards to runtime.
Still to do: Reliable comparison regarding accuracy and dev-loss development per augmentation...

Jun 23 '20 16:06 tilmankamp

I had some time to run some more tests today (with master about two days ago).

This time an epoch did take about 4:30min on average. I also tried different dropout values:

with --augment dropout[p=1,rate=0.05] which I thought should match --augmentation_spec_dropout_keeprate 0.95 (did this change?) the network only learned for two epochs, so it almost trained nothing.
also --augment dropout[p=0.5,rate=0.05] did produce really poor results (test loss: 43.882633).

Jul 10 '20 17:07 DanBmh

@tilmankamp Any updates on the accuracy problem?

Oct 01 '20 12:10 DanBmh

@DanBmh -- did you ever reach a conclusion on this? have you been running augmentation with newer releases?

Nov 12 '20 20:11 JRMeyer

Were there important changes to the augmentations in between? I didn't check for it.

I didn't run further tests, just the ones above. For my own trainings I still use the old version.

Nov 17 '20 12:11 DanBmh

Might have found a reason for the accuracy problem. First I did misunderstand the augmentation flag description and the pitch and tempo flags are not converted correctly. Second, the new start:stop logic could be another reason. I normally use a high training epoch number like 1000, because the training is stopped with early-stopping. But I assume that the :stop is related to the epochs flag and I'm therefore using only the start values for the augmentations instead of the full range.

Will try to run a test in the next time, but I don't believe this will also solve the slower training.

For the second problem maybe a new flag like augment_growth_epochs could be helpful for better combination with early-stopping.

Dec 08 '20 14:12 DanBmh

For the second problem maybe a new flag like augment_growth_epochs could be helpful for better combination with early-stopping.

Yeah, that could be useful, usually for hyperparameter schedules there's a separate start/ramp-up/ramp-down/stop range compared to the number of steps/epochs for the whole training run.

Dec 08 '20 14:12 reuben

DeepSpeech DeepSpeech copied to clipboard

Training performance drop after augmentation refactoring

DeepSpeech
DeepSpeech copied to clipboard