FloatingPointError: Fatal error: gradients are inconsistent between workers
🐛 Bug
I can train Transformers but not Fully convolutional or LSTMs models (e.g.: fconv,fconv_iwslt_de_en, fconv_wmt_en_de, lstm, lstm_luong_wmt_en_de,...) because gradients are inconsistent between workers.
Following this thread, I have tried a range of different args such as --ddp-backend=no_c10d, --ddp-backend=legacy_ddp, --use-bmuf,.... Furthermore, I've also limited the max-tokens (256, 512, 1024,...) and the batch size (16, 32, 64) to account for memory problems.
To Reproduce
I'm using the europarl v7 es-en dataset tokenized with Moses and fastBPE, but this error appears regardless of the dataset.
fairseq-train \
$BASE_PATH/data-bin \
--arch transformer \
--optimizer adam \
--criterion cross_entropy \
--max-tokens 2048 \
--max-epoch 50 \
--seed 1234 \
--clip-norm 1.0 \
--patience 5 \
--save-dir $BASE_PATH/checkpoints \
--log-format simple \
--no-epoch-checkpoints \
Error:
2021-09-29 14:33:21 | INFO | fairseq.trainer | begin training epoch 1
2021-09-29 14:33:21 | INFO | fairseq_cli.train | Start iterating over samples
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:949: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:949: UserWarning: Using non-full backward hooks on a Module that does not take as input a single Tensor or a tuple of Tensors is deprecated and will be removed in future versions. This hook will be missing some of the grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using non-full backward hooks on a Module that does not take as input a "
/home/scarrion/packages/fairseq/fairseq/utils.py:373: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
/home/scarrion/packages/fairseq/fairseq/utils.py:373: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
"amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-09-29 14:33:40 | INFO | root | Reducer buckets have been rebuilt in this iteration.
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:974: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py:974: UserWarning: Using a non-full backward hook when the forward contains multiple autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_input. Please use register_full_backward_hook to get the documented behavior.
warnings.warn("Using a non-full backward hook when the forward contains multiple autograd Nodes "
Traceback (most recent call last):
File "/home/scarrion/anaconda3/bin/fairseq-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
File "/home/scarrion/packages/fairseq/fairseq_cli/train.py", line 507, in cli_main
distributed_utils.call_main(cfg, main)
File "/home/scarrion/packages/fairseq/fairseq/distributed/utils.py", line 351, in call_main
join=True,
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/scarrion/packages/fairseq/fairseq/trainer.py", line 860, in train_step
self._check_grad_norms(grad_norm)
File "/home/scarrion/packages/fairseq/fairseq/trainer.py", line 1405, in _check_grad_norms
+ "-" * 80
FloatingPointError: Fatal error: gradients are inconsistent between workers. Try --ddp-backend=legacy_ddp. Or are you mixing up different generation of GPUs in training?
--------------------------------------------------------------------------------
grad_norm across the workers:
rank 0 = inf
rank 1 = inf
--------------------------------------------------------------------------------
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/scarrion/packages/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "/home/scarrion/packages/fairseq/fairseq_cli/train.py", line 180, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/home/scarrion/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/scarrion/packages/fairseq/fairseq_cli/train.py", line 291, in train
log_output = trainer.train_step(samples)
File "/home/scarrion/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/scarrion/packages/fairseq/fairseq/trainer.py", line 897, in train_step
**extra_kwargs,
File "/home/scarrion/packages/fairseq/fairseq/tasks/fairseq_task.py", line 492, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/scarrion/packages/fairseq/fairseq/criterions/cross_entropy.py", line 35, in forward
net_output = model(**sample["net_input"])
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/scarrion/packages/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 55, in forward
return self.module(*args, **kwargs)
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/scarrion/packages/fairseq/fairseq/models/fairseq_model.py", line 319, in forward
encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
File "/home/scarrion/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1088, in _call_impl
var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
StopIteration
Environment
- fairseq Version: fairseq-1.0.0a0+f34abcf (master)
- PyTorch Version (e.g., 1.0): 1.8.1
- OS: Linux
- How you installed fairseq (
pip, source):pip install --editable ./ - Python version: Python 3.7.4
- CUDA/cuDNN version: 10.2
- GPU models and configuration: 2x (TITAN Xp 12GB)
I solved it by using a different optimizer (nag instead of adam). At this point, I don't know if this is a bug or just a weird way of saying that the hyperparameters I chose for my model are not correct
I have the same problem, and the grad_norm on one of workers is 0. I don't why until now. Thanks for your solution.
Same problem here. But my grad_norm values are moderate (not nan, inf nor 0) and quite close.
grad_norm across the workers:
rank 0 = 11.16489500
rank 1 = 10.57914402
I have confirmed that my 2 GPUs are the same.
I can solve it by changing it to no_c10d but I still would like to figure out why I cannot use c10d for acceleration.
Other solutions (not work for me)
Could you take a look at this? Any suggestions would be highly appreciated. Thanks for taking your time, @dianaml0!