ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Using mpirun to launch multi-node training stucks in colossalai.launch_from_openmpi

Open pluiez opened this issue 1 year ago • 6 comments

🐛 Describe the bug

Running mpirun to lanuch distrtibuted training on 2 nodes (2x8 GPUs) stucks in colossalai.launch_from_openmpi() function. The 16 processes can be found using top command on the 2 nodes.

Lanuch command: mpirun --allow-run-as-root -np 16 -hostfile hosts python train.py --config configs/config.py --host 10.80.210.83 --port 29500

The hosts file contains the following content:

10.80.210.83 slots=8
10.80.209.79 slots=8

Environment

No response

pluiez avatar Feb 16 '23 10:02 pluiez

Try mpirun --allow-run-as-root -np 16 -hostfile hosts python train.py --config configs/config.py --host 10.80.210.83:8,10.80.209.79:8 --port 29500?

JThh avatar Feb 16 '23 16:02 JThh

No, they are not recognized as valid addresses. image

pluiez avatar Feb 17 '23 12:02 pluiez

Can @FrankLeeeee help take a look at this issue? Thanks!

JThh avatar Feb 17 '23 13:02 JThh

My intuition is something happened in this function caused the hang.

JThh avatar Feb 17 '23 13:02 JThh

Same problem, has this bug been fixed?

SamaelChen avatar Apr 04 '23 02:04 SamaelChen

Hi @pluiez @SamaelChen , we use the --host & --port as the master address for Pytorch distributed communication, which is supposed to be a single address.

kurisusnowdeng avatar Apr 19 '23 03:04 kurisusnowdeng

Using mpirun I noticed colossalAI initializes the task local rank using the environment RANK variable which mpirun doesn't set up. It may use OMPI_COMM_WORLD_LOCAL_RANK instead.

RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py", line 208, in launch_from_torch
    rank = int(os.environ['RANK'])

How do you handle this?

gfiameni avatar May 10 '23 09:05 gfiameni