ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]:

Open lyzKF opened this issue 1 year ago β€’ 5 comments

πŸ› Describe the bug

Error: failed to run torchrun --nproc_per_node=4 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=ip:29501 --rdzv_id=colossalai-default-job train.py --strategy colossalai_zero2 on gpu-1648, is localhost: False, exception: No authentication methods available

train model on multi-nodes with "colossalai run", i got the problem, i don't know why gpu-1648 is the hostname in master.

Environment

cuda11.3 python==3.8 pytorch==1.12

lyzKF avatar Mar 09 '23 04:03 lyzKF

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Title: [BUG]:

Issues-translate-bot avatar Mar 09 '23 04:03 Issues-translate-bot

Hi, do you have slurm or openmpi libs installed on your machines? If so, you may choose to launch from them instead of using torch.distributed. Refer to this code file for details.

JThh avatar Mar 10 '23 06:03 JThh

@JThh sorry, i don't have "slurm" and "openmpi" libs on my machines, i test the demo on 2 machines(A and B), we can access to each other with "ssh".

here is the "sh-file" in my experiments

colossalai run --nproc_per_node=4 --master_addr="*.*.*.*" --master_port="****" \
       --hostfile=hostfile \
       train_dummy.py --strategy colossalai_zero2

lyzKF avatar Mar 10 '23 07:03 lyzKF

I guess it is some issue with colossalai run. Would you please try torchrun directly by referring to this?

JThh avatar Mar 10 '23 09:03 JThh

Has this issue been resolved?

JThh avatar Apr 24 '23 15:04 JThh

yes, it was sovled

| | @.*** | | @.*** |

---- Replied Message ---- | From | Jiatong (Julius) @.> | | Date | 04/24/2023 23:46 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |

Has this issue been resolved?

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lyzKF avatar Apr 26 '23 09:04 lyzKF

Would you please close it? Thanks!

JThh avatar Apr 26 '23 10:04 JThh

ok, i will close it later

| | @.*** | | @.*** |

---- Replied Message ---- | From | Jiatong (Julius) @.> | | Date | 04/26/2023 18:12 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |

Would you please close it? Thanks!

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

lyzKF avatar Apr 26 '23 10:04 lyzKF

@lyzKF Hello, I have the same error with you, when I tried to run:

colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py

I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available

Could you please share the solution? Thank you so much!

wangbluo avatar Aug 10 '23 13:08 wangbluo

for my case, node01 can not communicate with node02,

| | @.*** | | @.*** |

---- Replied Message ---- | From | @.> | | Date | 08/10/2023 21:59 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |

@lyzKF Hello, I have the same error with you, when I tried to run:

colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py

I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available

Could you please share the solution? Thank you so much!

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lyzKF avatar Aug 11 '23 01:08 lyzKF

@lyzKF Hello, I have the same error with you, when I tried to run:

colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py

I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available

Could you please share the solution? Thank you so much!

I have a same question with you ,I train based on k8s, Could you please share the solution? Thank you so much!

1099692150 avatar May 29 '24 11:05 1099692150

so sorry, Ops help me with this issue

---- Replied Message ---- | From | @.> | | Date | 05/29/2024 19:36 | | To | @.> | | Cc | @.>@.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: (Issue #3066) |

@lyzKF Hello, I have the same error with you, when I tried to run:

colossalai run --nproc_per_node 8 --host 10.90.5.14,10.90.8.153 --master_addr 10.90.5.14 auto_parallel_with_gpt.py

I got the error : Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.5.14:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.153, is localhost: False, exception: No authentication methods available

Could you please share the solution? Thank you so much!

I have a same question with you ,I train based on k8s, Could you please share the solution? Thank you so much!

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lyzKF avatar May 30 '24 03:05 lyzKF