fairseq Speech recognition reproducibility

Hi,

I am having trouble reproducing the speech recognition results. With the default settings, the model stagnates at 25% train accuracy. By employing a different optimizer, increasing the batch size and tuning the lr, I was able to reach 8% WER, but that is far from the claimed 5% without tuning.

Could you please provide additional info about your configuration (the model and number of GPUs, the total batch size), or even better: logs and/or model checkpoints?

Thank you.

Sep 17 '19 11:09 Bobrosoft98

@okhonko

Sep 17 '19 11:09 huihuifan

Hi,

I'm having similar results on 1 GPU for a different dataset. Could you share with us the parameters you used to improve the results?

Thank you

Sep 19 '19 07:09 carlosep93

hi, i was having similar issues but was able to do better with the default settings on one gpu by simulating the larger batch size with --update-freq 16

Sep 23 '19 20:09 alexbie98

@alexbie98 I actually used this parameter when training on 1 GPU, and it didn't help. Can you elaborate on "do better"? Did you replicate the paper's WER?

@carlosep93 My parameters were: --optimizer adam --lr 5e-4 --fp16 --memory-efficient-fp16 --warmup-updates 2500 --update-freq 4

I also changed the batching logic to pack as much data on each GPU as possible, resulting in the average batch size 670 for all 8 GPUs. Only after that it started properly training.

Sep 24 '19 18:09 Bobrosoft98

right now it's at 96% train acc/91.7% valid acc after training for 5 days (epoch 31). Have not yet matched the reported WER, getting 9.9 on the current checkpoint. The loss/acc plateaus for a bit before dropping quite low.

https://i.imgur.com/XBL1TZo.png

Sep 24 '19 18:09 alexbie98

Wow, that looks nice! What batch size do you have? Also, could you share the accuracy plot?

Sep 24 '19 19:09 Bobrosoft98

https://i.imgur.com/dKadcXq.png

The effective batch size is 80k. My training command is the same as the one in the repo with --update-freq 16

Sep 24 '19 20:09 alexbie98

Thanks for providing the plot! Are you sure about 80k? I think, the whole librispeech train set has around 200k utterances, which means 3 batches per epoch in your case.

Sep 24 '19 23:09 Bobrosoft98

sorry 80k tokens*, using the default command's --max-tokens 5000 with --update-freq 16, the average number of sentences is around 60

Sep 25 '19 14:09 alexbie98

https://i.imgur.com/dKadcXq.png

The effective batch size is 80k. My training command is the same as the one in the repo with --update-freq 16 sorry for oot reply,

Could you share how do you plot the training accuracy?

Apr 18 '21 14:04 edosyhptra

https://i.imgur.com/dKadcXq.png The effective batch size is 80k. My training command is the same as the one in the repo with --update-freq 16 sorry for oot reply,

Could you share how do you plot the training accuracy?

If I recall correctly, specifying a directory to --tensorboard-logdir will generate these plots viewable from tensorboard. I haven't used this a while though.

Apr 19 '21 03:04 alexbie98

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

Jul 21 '21 00:07 stale[bot]

@alexbie98 do you still have the code when you run it right now ?

Mar 01 '25 05:03 itsmekhoathekid

I don't have the code. I have notes of the hyperparameters: adadelta with lr=1.0; --update-freq=16 + 5k tokens = 80k tokens; dropout=0.15, gradient clipping=10.0

Mar 03 '25 05:03 alexbie98

damn i've reduced the number of params, used different optimizers and bigger batch size and got overfitting issue lol

Mar 03 '25 16:03 itsmekhoathekid

fairseq fairseq copied to clipboard

Speech recognition reproducibility

fairseq
fairseq copied to clipboard