ColossalAI
ColossalAI copied to clipboard
[BUG]:
π Describe the bug
Error: failed to run torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=ip:29501 --rdzv_id=colossalai-default-job train.py --strategy colossalai_zero2 on gpu-1648, is localhost: False, exception: No authentication methods available
train model on multi-nodes with "colossalai run", i got the problem, i don't know why gpu-1648 is the hostname in master.
Environment
cuda11.3 python==3.8 pytorch==1.12
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Title: [BUG]:
Hi, do you have slurm or openmpi libs installed on your machines? If so, you may choose to launch
from them instead of using torch.distributed
. Refer to this code file for details.
@JThh sorry, i don't have "slurm" and "openmpi" libs on my machines, i test the demo on 2 machines(A and B), we can access to each other with "ssh".
here is the "sh-file" in my experiments
colossalai run --nproc_per_node=4 --master_addr="*.*.*.*" --master_port="****" \
--hostfile=hostfile \
train_dummy.py --strategy colossalai_zero2
I guess it is some issue with colossalai run
. Would you please try torchrun
directly by referring to this?
Has this issue been resolved?
yes, it was sovled
| | @.*** | | @.*** |
---- Replied Message ---- | From | Jiatong (Julius) @.> | | Date | 04/24/2023 23:46 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |
Has this issue been resolved?
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Would you please close it? Thanks!
ok, i will close it later
| | @.*** | | @.*** |
---- Replied Message ---- | From | Jiatong (Julius) @.> | | Date | 04/26/2023 18:12 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |
Would you please close it? Thanks!
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
@lyzKF Hello, I have the same error with you, when I tried to run:
colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py
I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available
Could you please share the solution? Thank you so much!
for my case, node01 can not communicate with node02,
| | @.*** | | @.*** |
---- Replied Message ---- | From | @.> | | Date | 08/10/2023 21:59 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |
@lyzKF Hello, I have the same error with you, when I tried to run:
colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py
I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available
Could you please share the solution? Thank you so much!
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
@lyzKF Hello, I have the same error with you, when I tried to run:
colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py
I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available
Could you please share the solution? Thank you so much!
I have a same question with you ,I train based on k8s, Could you please share the solution? Thank you so much!
so sorry, Ops help me with this issue
---- Replied Message ---- | From | @.> | | Date | 05/29/2024 19:36 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |
@lyzKF Hello, I have the same error with you, when I tried to run:
colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py
I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available
Could you please share the solution? Thank you so much!
I have a same question with you ,I train based on k8s, Could you please share the solution? Thank you so much!
β Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>