hostfile configuration in multi-node training
According to the instructions in the readme, if I want to do distributed training with multiple machines, I need to use Colossal-AI like this
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
configs/opensora-v1-2/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
but is there documentation on how to write hostfiles here?
Text file with host IPs would be enough.
**.***.**.**
**.***.**.**
thank you very much! But I get an error when I start training.
/bin/bash: SSH_CONNECTION: readonly variable
/bin/bash: SSH_REMOTE_USER: readonly variable
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=x.x.x.x --master_port=29500 scripts/train.py configs/opensora-v1-2/train/stage3.py --data-path mydata.csv --ckpt-path OpenSora-STDiT-v2-stage3 on x.x.x.x, is localhost: False, exception: No authentication methods available
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=x.x.x.x --master_port=29500 scripts/train.py configs/opensora-v1-2/train/stage3.py --data-path mydata.csv --ckpt-path OpenSora-STDiT-v2-stage3 on x.x.x.x, is localhost: True, exception: Encountered a bad command exit code!
does it mean that I need to enable some ssh services of my training machines to use colossalai?
I used hostfile but the program got stuck. --host works well for me. --hostfile, stuck there forever.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.