Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

hostfile configuration in multi-node training

Open zhenbuxianggaimingzi opened this issue 1 year ago • 4 comments

According to the instructions in the readme, if I want to do distributed training with multiple machines, I need to use Colossal-AI like this

colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-2/train/stage1.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

but is there documentation on how to write hostfiles here?

zhenbuxianggaimingzi avatar Aug 05 '24 06:08 zhenbuxianggaimingzi

Text file with host IPs would be enough.

**.***.**.**
**.***.**.**

yjhong89 avatar Aug 05 '24 06:08 yjhong89

thank you very much! But I get an error when I start training.

/bin/bash: SSH_CONNECTION: readonly variable
/bin/bash: SSH_REMOTE_USER: readonly variable
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=x.x.x.x --master_port=29500 scripts/train.py configs/opensora-v1-2/train/stage3.py --data-path mydata.csv --ckpt-path OpenSora-STDiT-v2-stage3 on x.x.x.x, is localhost: False, exception: No authentication methods available
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=x.x.x.x --master_port=29500 scripts/train.py configs/opensora-v1-2/train/stage3.py --data-path mydata.csv --ckpt-path OpenSora-STDiT-v2-stage3 on x.x.x.x, is localhost: True, exception: Encountered a bad command exit code!

does it mean that I need to enable some ssh services of my training machines to use colossalai?

zhenbuxianggaimingzi avatar Aug 05 '24 07:08 zhenbuxianggaimingzi

I used hostfile but the program got stuck. --host works well for me. --hostfile, stuck there forever.

tyz1994 avatar Aug 05 '24 10:08 tyz1994

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 15 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 22 '24 01:08 github-actions[bot]