fairseq
fairseq copied to clipboard
Gradient overflow when running the example of Adaptive Input Representations
❓ Questions and Help
What is your question?
Got inf
loss and gradient overflow when running the code example of adaptive input representation with --fp16
. I am trying to reproduce the results of Baevski and Auli, 2018, and the code example provided by fairseq is pretty fine with fp32
. However, the model doesn't work well when I use fp16
to reduce the training time, following Baevski and Auli, 2018 . Are there any tips for preventing the model from inf loss?
Code
Almost the same as the code in this link except for the fp16
argument:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py \
--task language_modeling \
$DEST/data-bin/wikitext-103 \
--save-dir $DEST/results/debug \
--arch transformer_lm_wiki103 \
--max-update 286000 --lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \
--warmup-updates 16000 --warmup-init-lr 1e-07 --stop-min-lr 1e-09 --optimizer nag --min-lr 0.0001 --clip-norm 0.1 \
--criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \
--sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=legacy_ddp \
--fp16
Results:
...
2022-03-21 14:38:24 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
2022-03-21 14:38:36 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
epoch 001: 0%| | 1/1401 [00:14<5:43:19, 14.71s/it]
2022-03-21 14:38:42 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
epoch 001: 0%|▏ | 2/1401 [00:21<3:48:24, 9.80s/it]
2022-03-21 14:38:49 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
epoch 001: 4%|███▎ | 53/1401 [01:06<14:50, 1.51it/s]
2022-03-21 14:39:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
epoch 001: 13%|▏| 185/1401 [02:34<13:29, 1.50it/s, loss=inf, ppl=inf, wps=108973, ups=1.48, wpb=73714.1, bsz=24, num_updates=100, lr=0.00
What's your environment?
- fairseq Version (e.g., 1.0 or main): main (1.0.0a0+b554f5e)
- PyTorch Version (e.g., 1.0): 1.8.0+cu101
- OS (e.g., Linux): Linux
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source): pip install --editable ./
- Python version: 3.7
- CUDA/cuDNN version: cu101
- GPU models and configuration:
- Any other relevant information:
Hi @ghrua, Have you found the solution? I am facing the same problem...
Hi @jamfly
Yes, actually there are two solutions I think:
- Detect the overflow operation step-by-step and address it.
- Pretrain the ADP model for 3 epochs using FP32, and then reload the parameters when training with FP16. Please reset the optimizer when you load the FP32 model under the FP16 setting.
The second one is somehow tricky... But it works for me.
Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:
- Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
- I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?
Thank you in advance.
- Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
Yes, I have tried FP16 from scratch with many hyper-parameters, e.g., different values of warmup updates and clip norm, but they didn't work for me.
- I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?
In the section of batching, the author said "This gives an effective batch size of 65K tokens for WIKITEXT-103.", where 65,000 / 8 / 3072 is around 2.6. I think that's why they set update-freq
to 3.
Hi @ghrua, Thank you for your suggestions, I will at least try the second one. I have questions regarding the training:
- Have you tried using fp16 from scratch, will it turn lnf loss to the normal scale?
- I noticed that you set update-freq to 3, but based on the description from the paper, they are using tokens-per-sample 4096 with 8 GPUs. I know they said that they changed to 3072 because of a better performance. But, is update-freq always set to 3?
Thank you in advance.
Hi @ghrua, I got it, they are using 4096 (tokens)* 8 (GPUs) * 2 (update-freq). Anyways, thank you for your kindly help and suggestions, I really appreciate it. Thank you.
can you replicate the results in paper? I ran the same recipe as yours, got a test ppl of 29.14, but the results in paper should be 18.7.
Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?
Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?
Yes, I can, as far as setting the right update frequency according to the gpus and batch size.
Hi @Psycoy, sorry for the late reply. Mine was 19.7, and despite being close to 18.7, it still has a gap. Can you reproduce their results?
Yes, I can, as far as setting the right update frequency according to the gpus and batch size.
Which script did you use to evaluate your model?