fairseq multi-node distributed training rank0 hang at dataloader after a few epochs

🐛 Bug

I am using fairseq to do multi-GPU distributed training. When I use 1-node 8-GPUs, the training works well. However, when I used 4-nodes 32-GPUs, after N epochs (N can be 1 to 3), the training will freeze. From the debugger, it shows that rank0 is stuck at dataloader place with the following stack. And all other 31 ranks are waiting for rank0.

This is 100% reproducible. I ran the same tests about 20 times.

Each time it will stuck at the same code place: rank0 will hang at dataloader, all other ranks will wait because of all_reduce(). When this hangs, rank0 already finished several steps (e.g. X is 5).
It always stuck at rank0.

Note: the exact line number can be different from latest github.

The stack for rank0:

Thread 0x7FA6E7C01740 (idle): "MainThread"
    wait (threading.py:302)
    get (queue.py:170)
    __next__ (fairseq/data/iterators.py:648)
    __iter__ (fairseq/data/iterators.py:59)
    _chunk_iterator (fairseq/data/iterators.py:528)
    __iter__ (fairseq/data/iterators.py:59)
    __iter__ (fairseq/logging/progress_bar.py:256)
    train (fairseq_cli/train.py:274)
    inner (contextlib.py:75)
    main (fairseq_cli/train.py:165)
    distributed_main (fairseq/distributed/utils.py:326)
    call_main (fairseq/distributed/utils.py:352)
    cli_main (fairseq_cli/train.py:499)
    <module> (train.py:14)

The stack for other ranks:

Thread 0x7F6F32481740 (active): "MainThread"
    _all_reduce_dict (fairseq/distributed/utils.py:661)
    all_reduce_dict (fairseq/distributed/utils.py:667)
    _fast_stat_sync_sum (fairseq/trainer.py:1195)
    _aggregate_logging_outputs (fairseq/trainer.py:1135)
    train_step (fairseq/trainer.py:708)
    inner (contextlib.py:75)
    train (fairseq_cli/train.py:278)
    inner (contextlib.py:75)
    main (fairseq_cli/train.py:165)
    distributed_main (fairseq/distributed/utils.py:326)
    call_main (fairseq/distributed/utils.py:352)
    cli_main (fairseq_cli/train.py:499)
    <module> (train.py:14)

To Reproduce

I am using my own dataset. The command I ran is

torchrun --nnodes=4 --nproc_per_node=8 --rdzv_id=fabcd202-79ed-42cd-a0f0-78a5c145e733 --rdzv_backend=c10d --rdzv_endpoint=node-0:6105 train.py dataset_name --num-workers 8 --ddp-backend=c10d ...

Code sample

Expected behavior

no hang at dataloader.

Environment

fairseq: 1.0.0a0 pytorch 1.11 OS: Linux Ubuntu 20.4 fairseq install: source Python version: 3.8 CUDA: 11.5 GPU: A100 80GB

Oct 19 '22 17:10 LiweiPeng

Ping. Any comments on this issue? thanks.

Oct 27 '22 14:10 LiweiPeng

the issue was not resolved yet. I found if using --num-workers 0, there is no hang. If using num-workers to be 2, it is much less likely to hang (still hang after long run).

Nov 16 '22 15:11 LiweiPeng

fairseq fairseq copied to clipboard

multi-node distributed training rank0 hang at dataloader after a few epochs

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

fairseq
fairseq copied to clipboard