ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: colossalai多机多卡训练,机器间通讯不了,或服务

Open Cloopen-ReLiNK opened this issue 1 year ago • 18 comments

🐛 Describe the bug

image

Environment

CUDA = 10.2 Python = 3.8.13 PyTorch = 1.12.1

Cloopen-ReLiNK avatar Mar 01 '23 08:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: colossalai multi-machine multi-card training, the communication between the machines can not, or the service

🐛 Describe the bug

image

Environment

CUDA=10.2 Python = 3.8.13 PyTorch = 1.12.1

Issues-translate-bot avatar Mar 01 '23 08:03 Issues-translate-bot

不报这个错了,,但是还是无法运行

Cloopen-ReLiNK avatar Mar 01 '23 11:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Don't report this error, but still can't run

Issues-translate-bot avatar Mar 01 '23 11:03 Issues-translate-bot

怎么配置的?

joan126 avatar Mar 02 '23 02:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


How is it configured?

Issues-translate-bot avatar Mar 02 '23 02:03 Issues-translate-bot

不报这个错了,,但是还是无法运行

socket 问题怎么解决的?我也遇到了

joan126 avatar Mar 03 '23 01:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Do not report this error, but still cannot run

How to solve the socket problem? I also met

Issues-translate-bot avatar Mar 03 '23 01:03 Issues-translate-bot

Hi @Cloopen-ReLiNK @joan126 你们可以使用colossalai run来替代torch run,colossalai run同时支持单机和多机的训练启动。

YuliangLiu0306 avatar Mar 03 '23 02:03 YuliangLiu0306

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


You can use colossalai run instead of torch run, colossalai run supports both single-machine and multi-machine training startup.

Issues-translate-bot avatar Mar 03 '23 02:03 Issues-translate-bot

我也遇到了 之前有人说配置/etc/hosts

JingxinLee avatar Mar 03 '23 03:03 JingxinLee

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I also encountered that someone said to configure /etc/hosts before

Issues-translate-bot avatar Mar 03 '23 03:03 Issues-translate-bot

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Did it work after configuration? I can't configure it, whether I use torchrun or colossalai run

Issues-translate-bot avatar Mar 03 '23 05:03 Issues-translate-bot

我也遇到了 之前有人说配置/etc/hosts

配置后,起作用了吗?我配置了也不行,不管是用torchrun还是colossalai run都不行

joan126 avatar Mar 03 '23 06:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I also encountered someone who said to configure /etc/hosts before

Did it work after configuration? I can't configure it, whether I use torchrun or colossalai run

Issues-translate-bot avatar Mar 03 '23 06:03 Issues-translate-bot

我也遇到了 之前有人说配置/etc/hosts

配置后,起作用了吗?我配置了也不行,不管是用torchrun还是colossalai run都不行

我配置了下可以。

  1. 代码中backend="nccl"。所以保证机器/node的nccl版本一样。
  2. master机器的/etc/hosts文件里 加入 IP 和 对应机器的hostname,如 1.2.3.4 dell-Tower-1

前两步先试试,不行的话再加上下面 export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME='en0' 注意 en0改成你机器ifconfig后对应的那个,我的就是enp0s31f6

JingxinLee avatar Mar 03 '23 07:03 JingxinLee

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I also encountered that someone said to configure /etc/hosts before

After configuration, does it work? I can't configure it, whether I use torchrun or colossalai run

I can configure it.

  1. Backend="nccl" in the code. So make sure the nccl version of the machine/node is the same.
  2. Add the IP and the hostname of the corresponding machine to the /etc/hosts file of the master machine, such as 1.2.3.4 dell-Tower-1

Try the first two steps first, if not, add the following export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME='en0' Note that en0 is changed to the one corresponding to your machine ifconfig, mine is enp0s31f6

Issues-translate-bot avatar Mar 03 '23 07:03 Issues-translate-bot

用完上述方法后错误还没有变化,还有别的方法可以解决吗

fearless1007 avatar Apr 17 '23 11:04 fearless1007

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


After using the above methods, the error has not changed. Is there any other way to solve it?

Issues-translate-bot avatar Apr 17 '23 11:04 Issues-translate-bot

用完上述方法后错误还没有变化,还有别的方法可以解决吗

Hi @fearless1007 If you have further questions, please open another new issue and provide details. Because everyone's issue details may be different. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell