fairseq FloatingPointError: Fatal error: gradients are inconsistent between workers

🐛 Bug

I can train Transformers but not Fully convolutional or LSTMs models (e.g.: fconv,fconv_iwslt_de_en, fconv_wmt_en_de, lstm, lstm_luong_wmt_en_de,...) because gradients are inconsistent between workers.

Following this thread, I have tried a range of different args such as --ddp-backend=no_c10d, --ddp-backend=legacy_ddp, --use-bmuf,.... Furthermore, I've also limited the max-tokens (256, 512, 1024,...) and the batch size (16, 32, 64) to account for memory problems.

To Reproduce

I'm using the europarl v7 es-en dataset tokenized with Moses and fastBPE, but this error appears regardless of the dataset.

fairseq-train \
    $BASE_PATH/data-bin \
    --arch transformer \
    --optimizer adam \
    --criterion cross_entropy \
    --max-tokens 2048 \
    --max-epoch	50 \
    --seed 1234 \
    --clip-norm 1.0 \
    --patience 5 \
    --save-dir $BASE_PATH/checkpoints \
    --log-format simple \
    --no-epoch-checkpoints \

Error:

2021-09-29 14:33:21 | INFO | fairseq.trainer | begin training epoch 1
2021-09-29 14:33:21 | INFO | fairseq_cli.train | Start iterating over samples
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:949: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:949: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/home/scarrion/packages/fairseq/fairseq/utils.py:373: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/home/scarrion/packages/fairseq/fairseq/utils.py:373: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-09-29 14:33:40 | INFO | root | Reducer buckets have been rebuilt in this iteration.
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:974: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:974: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
  warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
Traceback (most recent call last):
  File "/home/scarrion/anaconda3/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/home/scarrion/packages/fairseq/fairseq_cli/train.py", line 507, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/home/scarrion/packages/fairseq/fairseq/distributed/utils.py", line 351, in call_main
    join=True,
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/scarrion/packages/fairseq/fairseq/trainer.py", line 860, in train_step
    self._check_grad_norms(grad_norm)
  File "/home/scarrion/packages/fairseq/fairseq/trainer.py", line 1405, in _check_grad_norms
    + "-" * 80
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank   0 = inf
rank   1 = inf

--------------------------------------------------------------------------------

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/scarrion/packages/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/home/scarrion/packages/fairseq/fairseq_cli/train.py", line 180, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/home/scarrion/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/scarrion/packages/fairseq/fairseq_cli/train.py", line 291, in train
    log_output = trainer.train_step(samples)
  File "/home/scarrion/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/home/scarrion/packages/fairseq/fairseq/trainer.py", line 897, in train_step
    **extra_kwargs,
  File "/home/scarrion/packages/fairseq/fairseq/tasks/fairseq_task.py", line 492, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/scarrion/packages/fairseq/fairseq/criterions/cross_entropy.py", line 35, in forward
    net_output = model(**sample["net_input"])
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/scarrion/packages/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
    return self.module(*args, **kwargs)
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/scarrion/packages/fairseq/fairseq/models/fairseq_model.py", line 319, in forward
    encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
  File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1088, in _call_impl
    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
StopIteration

Environment

fairseq Version: fairseq-1.0.0a0+f34abcf (master)
PyTorch Version (e.g., 1.0): 1.8.1
OS: Linux
How you installed fairseq (pip, source): pip install --editable ./
Python version: Python 3.7.4
CUDA/cuDNN version: 10.2
GPU models and configuration: 2x (TITAN Xp 12GB)

Sep 29 '21 12:09 salvacarrion

I solved it by using a different optimizer (nag instead of adam). At this point, I don't know if this is a bug or just a weird way of saying that the hyperparameters I chose for my model are not correct

Sep 29 '21 17:09 salvacarrion

I have the same problem, and the grad_norm on one of workers is 0. I don't why until now. Thanks for your solution.

Dec 26 '21 15:12 MingLunHan

Same problem here. But my grad_norm values are moderate (not nan, inf nor 0) and quite close.

grad_norm across the workers:
rank   0 = 11.16489500
rank   1 = 10.57914402

I have confirmed that my 2 GPUs are the same.

I can solve it by changing it to no_c10d but I still would like to figure out why I cannot use c10d for acceleration.

Other solutions (not work for me)

Issue #1372 Reason: GPU ran out of memory Solution: reduce max-tokens

Issue #3920 (this issue) Reason: - Solution: change optimizer from adam to nag

Could you take a look at this? Any suggestions would be highly appreciated. Thanks for taking your time, @dianaml0!

Jul 15 '22 08:07 ShunchiZhang