DeepSpeech
DeepSpeech copied to clipboard
Training performance drop after augmentation refactoring
I have the feeling that the training performance, duration and accuracy, got worse after the augmentation refactoring commits.
Before the training of my dataset took about 2:10h and today it took 3:20h, with about the same number of epochs. At the beginning one epoch takes about 5min, but after epoch 18 they suddenly need 8min on average. I didn't see this behaviour in the trainings two days ago.
Also the accuracy got a bit worse:
Dataset | Additional Infos | Losses | Training epochs of best model | Result |
---|---|---|---|---|
Voxforge | Test: 32.844025, Validation: 36.912005 | 14 | WER: 0.240091, CER: 0.087971 | |
Voxforge | without freq_and_time_masking augmentation | Test: 33.698494, Validation: 38.071722 | 10 | WER: 0.244600, CER: 0.094577 |
Voxforge | using new audio augmentation options (AUG_AUDIO code1) | Test: 29.280865, Validation: 33.294815 | 21 | WER: 0.220538, CER: 0.079463 |
Voxforge | after refactoring | Test: 33.317413, Validation: 38.678969 | 20 | WER: 0.243480, CER: 0.088640 |
This were the options I set before:
AUG_PITCH_TEMPO="--augmentation_pitch_and_tempo_scaling \
--augmentation_pitch_and_tempo_scaling_min_pitch 0.98 \
--augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
--augmentation_pitch_and_tempo_scaling_max_tempo 1.2"
AUG_ADD_DROP="--data_aug_features_additive 0.2 \
--augmentation_spec_dropout_keeprate 0.95"
AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
AUG_AUDIO="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment gaps[p=0.05,n=1:3~2,size=10:100] \
--augment resample[p=0.1,rate=12000:8000~4000] \
--augment codec[p=0.1,bitrate=48000:16000] \
--augment volume[p=0.1,dbfs=-10:-40]"
And those I used for my todays run:
AUG_AUDIO="--augment volume[p=0.1,dbfs=-10:-40] \
--augment pitch[p=0.1,pitch=1.1~0.95] \
--augment tempo[p=0.1,factor=1.25~0.75]"
AUG_ADD_DROP="--augment dropout[p=0.1,rate=0.05] \
--augment add[p=0.1,domain=signal,stddev=0~0.5]"
AUG_FREQ_TIME="--augment frequency_mask[p=0.1,n=1:3,size=1:5] \
--augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40]"
AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment resample[p=0.1,rate=12000:8000~4000] \
--augment codec[p=0.1,bitrate=48000:16000]"
I also had to reduce the batch size from 30 to 24 because I got the error #3088. About two month ago I could use 36 without any problems.
I know there is a bit randomness in the accuracy and I did change some of the augmentation params slightly but the change in results is bigger than expected.
@tilmankamp do you have an idea about this?
I did run another test with the code directly before the refactoring (#188a6f2c1ee53dc79acf8abceaf729b5f9a05e7a).
This time one epoch takes 4min on average and the whole training took 1:45h.
Dataset | Additional Infos | Losses | Training epochs of best model | Result |
---|---|---|---|---|
Voxforge | Test: 28.846869, Validation: 32.680268 | 16 | WER: 0.225360, CER: 0.083504 |
I now used batch size of 24 and did update the params again to better match the params above:
AUG_AUDIO="--augmentation_pitch_and_tempo_scaling \
--augmentation_pitch_and_tempo_scaling_min_pitch 0.95 \
--augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
--augmentation_pitch_and_tempo_scaling_max_tempo 1.25"
AUG_ADD_DROP="--data_aug_features_additive 0.25 \
--augmentation_spec_dropout_keeprate 0.95"
AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
--augment gaps[p=0.05,n=1:3~2,size=10:100] \
--augment resample[p=0.1,rate=12000:8000~4000] \
--augment codec[p=0.1,bitrate=48000:16000] \
--augment volume[p=0.1,dbfs=-10:-40]"
@DanBmh Augmentations volume
gaps
reverb
codec
resample
and overlay
are most likely not responsible for this discrepancy as their implementations have not been changed during refactoring. For the others it'd be helpful to compare them one by one with their former implementations to get a better understanding of the problem. I'll do some performance tests here.
My observations so far:
- Regarding batch size: I got the biggest difference to the old implementation when switching from the former combined
--augmentation_pitch_and_tempo_scaling
to--augment pitch
plus--augment tempo
. This additional memory requirement comes from the doubling of certain allocations as the involved ops are not part of one augmentation sub-graph anymore. With the refactored code I had to decrease BS from 38 to 35 to get it working. - The new internal
clock
tensor has some very small overhead that should in most cases not require a BS adjustment. - The way how dropout is implemented now seems to require slightly more memory, as there is a tensor of random values allocated that is of the same size as the augmentation target.
- At least when comparing dropout (more tests needed) there was no difference in regards to runtime.
- Still to do: Reliable comparison regarding accuracy and dev-loss development per augmentation...
I had some time to run some more tests today (with master about two days ago).
This time an epoch did take about 4:30min on average. I also tried different dropout values:
- with
--augment dropout[p=1,rate=0.05]
which I thought should match--augmentation_spec_dropout_keeprate 0.95
(did this change?) the network only learned for two epochs, so it almost trained nothing. - also
--augment dropout[p=0.5,rate=0.05]
did produce really poor results (test loss: 43.882633).
@tilmankamp Any updates on the accuracy problem?
@DanBmh -- did you ever reach a conclusion on this? have you been running augmentation with newer releases?
Were there important changes to the augmentations in between? I didn't check for it.
I didn't run further tests, just the ones above. For my own trainings I still use the old version.
Might have found a reason for the accuracy problem. First I did misunderstand the augmentation flag description and the pitch and tempo flags are not converted correctly. Second, the new start:stop
logic could be another reason. I normally use a high training epoch number like 1000, because the training is stopped with early-stopping. But I assume that the :stop
is related to the epochs flag and I'm therefore using only the start values for the augmentations instead of the full range.
Will try to run a test in the next time, but I don't believe this will also solve the slower training.
For the second problem maybe a new flag like augment_growth_epochs
could be helpful for better combination with early-stopping.
For the second problem maybe a new flag like
augment_growth_epochs
could be helpful for better combination with early-stopping.
Yeah, that could be useful, usually for hyperparameter schedules there's a separate start/ramp-up/ramp-down/stop range compared to the number of steps/epochs for the whole training run.