ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: 多机并行 问题

Open Cloopen-ReLiNK opened this issue 2 years ago • 5 comments

🐛 Describe the bug

(ColossalAI) root@VM-48-4-centos:/ytx-data/apps/algserver/llm/ColossalAI/examples/language/gpt/gemini# colossalai run --host VM-48-4-centos,VM-48-13-centos --master_addr=VM-48-4-centos --master_port=23456 --nproc_per_node 2 train_multi_node.py

Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=VM-48-4-centos:23456 --rdzv_id=colossalai-default-job train_multi_node.py on VM-48-13-centos, is localhost: False, exception: No authentication methods available WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[19213] Initializing process group with: {'MASTER_ADDR': 'VM-48-13-centos', 'MASTER_PORT': '60734', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '2'} [19214] Initializing process group with: {'MASTER_ADDR': 'VM-48-13-centos', 'MASTER_PORT': '60734', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '2'} [19214] (rank = 3, local_rank = 1) training... [19213] (rank = 2, local_rank = 0) training... Traceback (most recent call last): File "train_multi_node.py", line 53, in run() File "train_multi_node.py", line 48, in run train() File "train_multi_node.py", line 26, in train ddp_model = DDP(model, [local_rank]) File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484657607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 19213 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 19214) of binary: /root/anaconda3/envs/ColossalAI/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/ColossalAI/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_multi_node.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-03-01_11:21:25 host : VM-48-4-centos rank : 3 (local_rank: 1) exitcode : 1 (pid: 19214) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=VM-48-4-centos:23456 --rdzv_id=colossalai-default-job train_multi_node.py on VM-48-4-centos, is localhost: True, exception: Encountered a bad command exit code!

Environment

CUDA=10.2 Python = 3.8.13 PyTorch = 1.12.1

Cloopen-ReLiNK avatar Mar 01 '23 11:03 Cloopen-ReLiNK

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: Multi-machine parallel problem

Issues-translate-bot avatar Mar 01 '23 11:03 Issues-translate-bot

请问你有解决这个问题吗?我们也是多机多卡有报这个错,我们内部推测是因为没有设置免密登录,不知道你有没有解决,想请教一下。

Vvvvvvsysy avatar Mar 02 '23 03:03 Vvvvvvsysy

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Do you have a solution to this problem? We also reported this error with multiple machines and multiple cards. We internally speculate that it is because there is no password-free login. I don’t know if you have solved it. I would like to ask for advice.

Issues-translate-bot avatar Mar 02 '23 03:03 Issues-translate-bot

你参考的哪个教程

chingfeng2021 avatar Mar 02 '23 10:03 chingfeng2021

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Which tutorial did you refer to

Issues-translate-bot avatar Mar 02 '23 10:03 Issues-translate-bot

This issue could be merged with #2958

YuliangLiu0306 avatar Mar 03 '23 09:03 YuliangLiu0306

@Cloopen-ReLiNK 嗨,帅哥,你这边最终怎么解决的呢?我也碰到了这个问题

lyzKF avatar Mar 09 '23 04:03 lyzKF

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@Cloopen-ReLiNK Hi dude, how did you end up solving this on your side? I also ran into this problem

Issues-translate-bot avatar Mar 09 '23 04:03 Issues-translate-bot

请问这个问题怎么解决呢?目前题主解决了吗?

sc-lj avatar Apr 12 '23 05:04 sc-lj

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


How to solve this problem? Is the subject solved now?

Issues-translate-bot avatar Apr 12 '23 05:04 Issues-translate-bot

Hi @Vvvvvvsysy @chingfeng2021 @sc-lj @lyzKF Can you each open a new issue describing the specific details of the errors encountered to help reproduce them? Similar errors may be caused by completely different reasons. Thanks.

We have tested multiple servers and GPUs in several different environments with no problems.

binmakeswell avatar Apr 18 '23 07:04 binmakeswell

Hi @Vvvvvvsysy @chingfeng2021 @sc-lj @lyzKF Can you each open a new issue describing the specific details of the errors encountered to help reproduce them? Similar errors may be caused by completely different reasons. Thanks.

We have tested multiple servers and GPUs in several different environments with no problems.

你好,我也遇到了同样的问题,我想问问如果想要多节点启动的必备因素是什么,比如host1和host2,两边需要在同一个目录准备好同一份训练脚本吗,还是说无需任何准备就可以直接调用目标机器的gpu

AntyRia avatar Aug 11 '23 01:08 AntyRia

我之前是因为节点之前网络不通

| | @.*** | | @.*** |

---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2023年08月11日 09:34 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [hpcaitech/ColossalAI] [BUG]: 多机并行 问题 (Issue #2959) |

Hi @@.@@. Can you each open a new issue describing the specific details of the errors encountered to help reproduce them? Similar errors may be caused by completely different reasons. Thanks.

We have tested multiple servers and GPUs in several different environments with no problems.

你好,我也遇到了同样的问题,我想问问如果想要多节点启动的必备因素是什么,比如host1和host2,两边需要在同一个目录准备好同一份训练脚本吗,还是说无需任何准备就可以直接调用目标机器的gpu

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

lyzKF avatar Aug 11 '23 01:08 lyzKF