ColossalAI
ColossalAI copied to clipboard
[BUG]: Using mpirun to launch multi-node training stucks in colossalai.launch_from_openmpi
🐛 Describe the bug
Running mpirun to lanuch distrtibuted training on 2 nodes (2x8 GPUs) stucks in colossalai.launch_from_openmpi()
function. The 16 processes can be found using top command on the 2 nodes.
Lanuch command:
mpirun --allow-run-as-root -np 16 -hostfile hosts python train.py --config configs/config.py --host 10.80.210.83 --port 29500
The hosts
file contains the following content:
10.80.210.83 slots=8
10.80.209.79 slots=8
Environment
No response
Try mpirun --allow-run-as-root -np 16 -hostfile hosts python train.py --config configs/config.py --host 10.80.210.83:8,10.80.209.79:8 --port 29500
?
No, they are not recognized as valid addresses.
Can @FrankLeeeee help take a look at this issue? Thanks!
My intuition is something happened in this function caused the hang.
Same problem, has this bug been fixed?
Hi @pluiez @SamaelChen , we use the --host
& --port
as the master address for Pytorch distributed communication, which is supposed to be a single address.
Using mpirun I noticed colossalAI initializes the task local rank using the environment RANK variable which mpirun doesn't set up. It may use OMPI_COMM_WORLD_LOCAL_RANK instead.
RuntimeError: Could not find 'RANK' in the torch environment, visit https://www.colossalai.org/ for more information on launching with torch
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py", line 208, in launch_from_torch
rank = int(os.environ['RANK'])
How do you handle this?