speechbrain Can't reproduce pretraining results for Wav2vec2 using LibriSpeech recipe

Describe the bug

Hello, I am pretraining Wav2vec2 following the instructions on this page. The pretraining went very smoothly (thank you for that!!), however, when I compared the training logs with the one published here, I found that my model finished the 400K steps in only 25 epochs (used 8 A100 gpus) with lower accuray (~60%) as opposed to 700 epochs and accuracy of around 68% as in the example checkpoint. Also, my training finished within 2 days only which is confusing.

Expected behaviour

I expect similar training performance when looking at the training logs.

To Reproduce

Below is the training log for my model which is different from the example one here:

epoch: 1, steps: 18223, lr: 3.04e-04 - train loss: 2.13e+04 - valid loss: 2.37e+03, valid accuracy: 0.35966309905052185 epoch: 2, steps: 36446, lr: 4.51e-04 - train loss: 1.66e+04 - valid loss: 2.07e+03, valid accuracy: 0.4265761077404022 epoch: 3, steps: 54669, lr: 4.28e-04 - train loss: 1.53e+04 - valid loss: 1.91e+03, valid accuracy: 0.46357864141464233 epoch: 4, steps: 72892, lr: 4.05e-04 - train loss: 1.47e+04 - valid loss: 1.83e+03, valid accuracy: 0.485844224691391 epoch: 5, steps: 91115, lr: 3.83e-04 - train loss: 1.43e+04 - valid loss: 1.77e+03, valid accuracy: 0.4983205795288086 epoch: 6, steps: 109338, lr: 3.60e-04 - train loss: 1.39e+04 - valid loss: 1.73e+03, valid accuracy: 0.5098890066146851 epoch: 7, steps: 127561, lr: 3.38e-04 - train loss: 1.36e+04 - valid loss: 1.68e+03, valid accuracy: 0.5208209753036499 epoch: 8, steps: 145784, lr: 3.15e-04 - train loss: 1.33e+04 - valid loss: 1.63e+03, valid accuracy: 0.5316717028617859 epoch: 9, steps: 164007, lr: 2.93e-04 - train loss: 1.30e+04 - valid loss: 1.59e+03, valid accuracy: 0.5391503572463989 epoch: 10, steps: 182230, lr: 2.70e-04 - train loss: 1.26e+04 - valid loss: 1.56e+03, valid accuracy: 0.5474251508712769 epoch: 11, steps: 200453, lr: 2.47e-04 - train loss: 1.23e+04 - valid loss: 1.52e+03, valid accuracy: 0.5530011653900146 epoch: 12, steps: 218676, lr: 2.25e-04 - train loss: 1.21e+04 - valid loss: 1.49e+03, valid accuracy: 0.5636028051376343 epoch: 13, steps: 236899, lr: 2.02e-04 - train loss: 1.18e+04 - valid loss: 1.46e+03, valid accuracy: 0.5687620639801025 epoch: 14, steps: 255122, lr: 1.80e-04 - train loss: 1.17e+04 - valid loss: 1.45e+03, valid accuracy: 0.5734390020370483 epoch: 15, steps: 273345, lr: 1.57e-04 - train loss: 1.15e+04 - valid loss: 1.43e+03, valid accuracy: 0.5788331031799316 epoch: 16, steps: 291568, lr: 1.34e-04 - train loss: 1.13e+04 - valid loss: 1.42e+03, valid accuracy: 0.5822480916976929 epoch: 17, steps: 309791, lr: 1.12e-04 - train loss: 1.12e+04 - valid loss: 1.40e+03, valid accuracy: 0.586252748966217 epoch: 18, steps: 328014, lr: 8.92e-05 - train loss: 1.11e+04 - valid loss: 1.39e+03, valid accuracy: 0.5907050967216492 epoch: 19, steps: 346237, lr: 6.66e-05 - train loss: 1.10e+04 - valid loss: 1.37e+03, valid accuracy: 0.596407413482666 epoch: 20, steps: 364460, lr: 4.41e-05 - train loss: 1.08e+04 - valid loss: 1.36e+03, valid accuracy: 0.5983026623725891 epoch: 21, steps: 382683, lr: 2.15e-05 - train loss: 1.08e+04 - valid loss: 1.34e+03, valid accuracy: 0.6026105880737305 epoch: 22, steps: 400000, lr: 0.00e+00 - train loss: 1.07e+04 - valid loss: 1.34e+03, valid accuracy: 0.6060941815376282 epoch: 23, steps: 400000, lr: 0.00e+00 - train loss: 0.00e+00 - valid loss: 1.33e+03, valid accuracy: 0.6060227155685425 epoch: 24, steps: 400000, lr: 0.00e+00 - train loss: 0.00e+00 - valid loss: 1.33e+03, valid accuracy: 0.6051703691482544 epoch: 25, steps: 400000, lr: 0.00e+00 - train loss: 0.00e+00 - valid loss: 1.33e+03, valid accuracy: 0.6063333749771118

Environment Details

I am using python 3.11 and speechbrain 1.0

Relevant Log Output

No response

Additional Context

No response

Apr 16 '24 22:04 GasserElbanna

Hello @GasserElbanna, thanks a lot for opening this issue!

Could you please @TParcollet and/or @salah-zaiem have a look? Thanks a lot :)

Apr 19 '24 12:04 Adel-Moumen

Hi, it's important that the total batch size corresponds to roughly 1.6h. By changing the gradient accumulation factor your can adjust this.

Apr 19 '24 14:04 TParcollet

Hello, thank you for the quick response. I used the default config file for pre-training. So, I am assuming these are the parameters below I need to adjust?

Dynamic Batching parameters: max_batch_length: 200 # Fits in a 32GB GPUs (V100) num_buckets: 70 shuffle: True # if true re-creates batches at each epoch shuffling examples. batch_ordering: random

Apr 19 '24 14:04 GasserElbanna

@Adel-Moumen i see that the gradient accumulation factor is missing on this recipe. Could you add it? (No need to PR imho push directly to develop).

@GasserElbanna have a look at any other yaml for asr in the libri folder, you will find the gradient accumulation factor param. Just copy and past it in this yaml, anywhere. Then play with grad accum / max batch len to make sure that you have 1.2-1.6h of speech per batch. Grad_accum * max_batch_len * nb gpu = 1.6h.

Also, your A100 must certainly be able to accommodate more than 200s.

Apr 20 '24 07:04 TParcollet

@Adel-Moumen i see that the gradient accumulation factor is missing on this recipe. Could you add it? (No need to PR imho push directly to develop).

Why would it be missing? By default, grad_accumulation_factor is set to 1 (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py#L84). The var is called in each fit_batch call (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py#L1199). As grad_accumulation_factor can also be set through a flag (see: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/core.py#L422-L426) the recipe is technically not missing from this feature. You just need to play with --grad_accumulation_factor=N where N is the grad acc steps.

Apr 20 '24 08:04 Adel-Moumen

Hi, thanks @TParcollet for the explanation, it's clearer now. Thanks @Adel-Moumen for pointing out the flag.

I am currently pretraining with --grad_accumulation_factor=2 and max_batch_length=400 on 8 gpus yielding 2 * 400 * 8 = 6400 (~1.8h).

Here's the logs for the first epoch: epoch: 1, steps: 4611, lr: 7.68e-05 - train loss: 4.84e+04 - valid loss: 2.86e+03, valid accuracy: 0.26230588555336

Apr 20 '24 23:04 GasserElbanna

Hi, thanks @TParcollet for the explanation, it's clearer now. Thanks @Adel-Moumen for pointing out the flag.

I am currently pretraining with --grad_accumulation_factor=2 and max_batch_length=400 on 8 gpus yielding 2 * 400 * 8 = 6400 (~1.8h).

Here's the logs for the first epoch: epoch: 1, steps: 4611, lr: 7.68e-05 - train loss: 4.84e+04 - valid loss: 2.86e+03, valid accuracy: 0.26230588555336

Seems to be similar to our model checkpoint. Note that now you have done during your first epoch "only" 4611 steps meaning that the training will go for much longer. I do expect that you'll get better results.

BTW, are you using --precision=fp16 for the pre-training?

Apr 21 '24 10:04 Adel-Moumen

BTW, are you using --precision=fp16 for the pre-training?

I am using fp32 now.

Apr 21 '24 14:04 GasserElbanna

fp16 or bf16 would make the training much faster if you have a compatible GPU.

Apr 23 '24 13:04 TParcollet

Hi @TParcollet @Adel-Moumen, I am just following up with an issue posted here related to training Wav2Vec 2.0 with multiple GPUs using torchrun. I was wondering if you have any inputs.

Jul 07 '24 19:07 GasserElbanna

speechbrain speechbrain copied to clipboard

Can't reproduce pretraining results for Wav2vec2 using LibriSpeech recipe

Describe the bug

Expected behaviour

To Reproduce

Environment Details

Relevant Log Output

Additional Context

speechbrain
speechbrain copied to clipboard