ColossalAI
ColossalAI copied to clipboard
[BUG]: colossalai多机多卡训练,机器间通讯不了,或服务
🐛 Describe the bug
Environment
CUDA = 10.2 Python = 3.8.13 PyTorch = 1.12.1
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [BUG]: colossalai multi-machine multi-card training, the communication between the machines can not, or the service
🐛 Describe the bug
Environment
CUDA=10.2 Python = 3.8.13 PyTorch = 1.12.1
不报这个错了,,但是还是无法运行
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Don't report this error, but still can't run
怎么配置的?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
How is it configured?
不报这个错了,,但是还是无法运行
socket 问题怎么解决的?我也遇到了
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Do not report this error, but still cannot run
How to solve the socket problem? I also met
Hi @Cloopen-ReLiNK @joan126 你们可以使用colossalai run来替代torch run,colossalai run同时支持单机和多机的训练启动。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
You can use colossalai run instead of torch run, colossalai run supports both single-machine and multi-machine training startup.
我也遇到了 之前有人说配置/etc/hosts
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I also encountered that someone said to configure /etc/hosts before
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Did it work after configuration? I can't configure it, whether I use torchrun or colossalai run
我也遇到了 之前有人说配置/etc/hosts
配置后,起作用了吗?我配置了也不行,不管是用torchrun还是colossalai run都不行
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I also encountered someone who said to configure /etc/hosts before
Did it work after configuration? I can't configure it, whether I use torchrun or colossalai run
我也遇到了 之前有人说配置/etc/hosts
配置后,起作用了吗?我配置了也不行,不管是用torchrun还是colossalai run都不行
我配置了下可以。
- 代码中backend="nccl"。所以保证机器/node的nccl版本一样。
- master机器的/etc/hosts文件里 加入 IP 和 对应机器的hostname,如 1.2.3.4 dell-Tower-1
前两步先试试,不行的话再加上下面 export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME='en0' 注意 en0改成你机器ifconfig后对应的那个,我的就是enp0s31f6
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
I also encountered that someone said to configure /etc/hosts before
After configuration, does it work? I can't configure it, whether I use torchrun or colossalai run
I can configure it.
- Backend="nccl" in the code. So make sure the nccl version of the machine/node is the same.
- Add the IP and the hostname of the corresponding machine to the /etc/hosts file of the master machine, such as 1.2.3.4 dell-Tower-1
Try the first two steps first, if not, add the following export NCCL_DEBUG=INFO export NCCL_SOCKET_IFNAME='en0' Note that en0 is changed to the one corresponding to your machine ifconfig, mine is enp0s31f6
用完上述方法后错误还没有变化,还有别的方法可以解决吗
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
After using the above methods, the error has not changed. Is there any other way to solve it?
用完上述方法后错误还没有变化,还有别的方法可以解决吗
Hi @fearless1007 If you have further questions, please open another new issue and provide details. Because everyone's issue details may be different. Thanks.