fairseq Gradient overflow when running the example of Adaptive Input Representations

Gradient overflow when running the example of Adaptive Input Representations

Open ghrua opened this issue 2 years ago • 9 comments

❓ Questions and Help

What is your question?

Got inf loss and gradient overflow when running the code example of adaptive input representation with --fp16. I am trying to reproduce the results of Baevski and Auli, 2018, and the code example provided by fairseq is pretty fine with fp32. However, the model doesn't work well when I use fp16 to reduce the training time, following Baevski and Auli, 2018 . Are there any tips for preventing the model from inf loss?

Code

Almost the same as the code in this link except for the fp16 argument:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py \
    --task language_modeling \
    $DEST/data-bin/wikitext-103 \
    --save-dir $DEST/results/debug \
    --arch transformer_lm_wiki103 \
    --max-update 286000 --lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
    --warmup-updates 16000 --warmup-init-lr 1e-07 --stop-min-lr 1e-09 --optimizer nag --min-lr 0.0001 --clip-norm 0.1 \
    --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
    --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=legacy_ddp \
    --fp16

Results:

...
2022-03-21 14:38:24 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2022-03-21 14:38:36 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
epoch 001:   0%|                                                                                        | 1/1401 [00:14<5:43:19, 14.71s/it]
2022-03-21 14:38:42 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
epoch 001:   0%|▏                                                                                       | 2/1401 [00:21<3:48:24,  9.80s/it]
2022-03-21 14:38:49 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
epoch 001:   4%|███▎                                                                                     | 53/1401 [01:06<14:50,  1.51it/s]
2022-03-21 14:39:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
epoch 001:  13%|▏| 185/1401 [02:34<13:29,  1.50it/s, loss=inf, ppl=inf, wps=108973, ups=1.48, wpb=73714.1, bsz=24, num_updates=100, lr=0.00

What's your environment?

fairseq Version (e.g., 1.0 or main): main (1.0.0a0+b554f5e)
PyTorch Version (e.g., 1.0): 1.8.0+cu101
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install --editable ./
Python version: 3.7
CUDA/cuDNN version: cu101
GPU models and configuration:
Any other relevant information:

Mar 21 '22 14:03 ghrua

Hi @ghrua, Have you found the solution? I am facing the same problem...

Apr 27 '22 08:04 freddy5566

Hi @jamfly

Yes, actually there are two solutions I think:

Detect the overflow operation step-by-step and address it.
Pretrain the ADP model for 3 epochs using FP32, and then reload the parameters when training with FP16. Please reset the optimizer when you load the FP32 model under the FP16 setting.

The second one is somehow tricky... But it works for me.

Apr 27 '22 08:04 ghrua

Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:

Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?

Thank you in advance.

Apr 27 '22 09:04 freddy5566

Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?

Yes, I have tried FP16 from scratch with many hyper-parameters, e.g., different values of warmup updates and clip norm, but they didn't work for me.

I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?

In the section of batching, the author said "This gives an effective batch size of 65K tokens for WIKITEXT-103.", where 65,000 / 8 / 3072 is around 2.6. I think that's why they set update-freq to 3.

Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:

Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?

I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?

Thank you in advance.

Apr 28 '22 03:04 ghrua

Hi @ghrua, I got it, they are using 4096 (tokens)* 8 (GPUs) * 2 (update-freq). Anyways, thank you for your kindly help and suggestions, I really appreciate it. Thank you.

Apr 28 '22 03:04 freddy5566

can you replicate the results in paper? I ran the same recipe as yours, got a test ppl of 29.14, but the results in paper should be 18.7.

Jul 07 '22 07:07 Psycoy

Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?

Oct 18 '22 10:10 freddy5566

Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?

Yes, I can, as far as setting the right update frequency according to the gpus and batch size.

Oct 18 '22 11:10 Psycoy

Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?

Yes, I can, as far as setting the right update frequency according to the gpus and batch size.

Which script did you use to evaluate your model?

Oct 18 '22 11:10 freddy5566

fairseq fairseq copied to clipboard

Gradient overflow when running the example of Adaptive Input Representations

❓ Questions and Help

What is your question?

Code

Results:

What's your environment?

fairseq
fairseq copied to clipboard