fairseq
fairseq copied to clipboard
multi-node distributed training rank0 hang at dataloader after a few epochs
🐛 Bug
I am using fairseq to do multi-GPU distributed training. When I use 1-node 8-GPUs, the training works well. However, when I used 4-nodes 32-GPUs, after N epochs (N can be 1 to 3), the training will freeze. From the debugger, it shows that rank0 is stuck at dataloader place with the following stack. And all other 31 ranks are waiting for rank0.
This is 100% reproducible. I ran the same tests about 20 times.
- Each time it will stuck at the same code place: rank0 will hang at dataloader, all other ranks will wait because of all_reduce(). When this hangs, rank0 already finished several steps (e.g. X is 5).
- It always stuck at rank0.
Note: the exact line number can be different from latest github.
The stack for rank0:
Thread 0x7FA6E7C01740 (idle): "MainThread"
wait (threading.py:302)
get (queue.py:170)
__next__ (fairseq/data/iterators.py:648)
__iter__ (fairseq/data/iterators.py:59)
_chunk_iterator (fairseq/data/iterators.py:528)
__iter__ (fairseq/data/iterators.py:59)
__iter__ (fairseq/logging/progress_bar.py:256)
train (fairseq_cli/train.py:274)
inner (contextlib.py:75)
main (fairseq_cli/train.py:165)
distributed_main (fairseq/distributed/utils.py:326)
call_main (fairseq/distributed/utils.py:352)
cli_main (fairseq_cli/train.py:499)
<module> (train.py:14)
The stack for other ranks:
Thread 0x7F6F32481740 (active): "MainThread"
_all_reduce_dict (fairseq/distributed/utils.py:661)
all_reduce_dict (fairseq/distributed/utils.py:667)
_fast_stat_sync_sum (fairseq/trainer.py:1195)
_aggregate_logging_outputs (fairseq/trainer.py:1135)
train_step (fairseq/trainer.py:708)
inner (contextlib.py:75)
train (fairseq_cli/train.py:278)
inner (contextlib.py:75)
main (fairseq_cli/train.py:165)
distributed_main (fairseq/distributed/utils.py:326)
call_main (fairseq/distributed/utils.py:352)
cli_main (fairseq_cli/train.py:499)
<module> (train.py:14)
To Reproduce
I am using my own dataset. The command I ran is
torchrun --nnodes=4 --nproc_per_node=8 --rdzv_id=fabcd202-79ed-42cd-a0f0-78a5c145e733 --rdzv_backend=c10d --rdzv_endpoint=node-0:6105 train.py dataset_name --num-workers 8 --ddp-backend=c10d ...
Code sample
Expected behavior
no hang at dataloader.
Environment
fairseq: 1.0.0a0 pytorch 1.11 OS: Linux Ubuntu 20.4 fairseq install: source Python version: 3.8 CUDA: 11.5 GPU: A100 80GB
Ping. Any comments on this issue? thanks.
the issue was not resolved yet. I found if using --num-workers 0, there is no hang. If using num-workers to be 2, it is much less likely to hang (still hang after long run).