ColossalAI
ColossalAI copied to clipboard
[BUG]: 多机并行 问题
🐛 Describe the bug
(ColossalAI) root@VM-48-4-centos:/ytx-data/apps/algserver/llm/ColossalAI/examples/language/gpt/gemini# colossalai run --host VM-48-4-centos,VM-48-13-centos --master_addr=VM-48-4-centos --master_port=23456 --nproc_per_node 2 train_multi_node.py
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=VM-48-4-centos:23456 --rdzv_id=colossalai-default-job train_multi_node.py on VM-48-13-centos, is localhost: False, exception: No authentication methods available WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[19213] Initializing process group with: {'MASTER_ADDR': 'VM-48-13-centos', 'MASTER_PORT': '60734', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '2'}
[19214] Initializing process group with: {'MASTER_ADDR': 'VM-48-13-centos', 'MASTER_PORT': '60734', 'WORLD_SIZE': '4', 'LOCAL_WORLD_SIZE': '2'}
[19214] (rank = 3, local_rank = 1) training...
[19213] (rank = 2, local_rank = 0) training...
Traceback (most recent call last):
File "train_multi_node.py", line 53, in
run()
File "train_multi_node.py", line 48, in run
train()
File "train_multi_node.py", line 26, in train
ddp_model = DDP(model, [local_rank])
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1659484657607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 19213 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 19214) of binary: /root/anaconda3/envs/ColossalAI/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/ColossalAI/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/ColossalAI/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_multi_node.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2023-03-01_11:21:25 host : VM-48-4-centos rank : 3 (local_rank: 1) exitcode : 1 (pid: 19214) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Error: failed to run torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=VM-48-4-centos:23456 --rdzv_id=colossalai-default-job train_multi_node.py on VM-48-4-centos, is localhost: True, exception: Encountered a bad command exit code!
Environment
CUDA=10.2 Python = 3.8.13 PyTorch = 1.12.1
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Title: [BUG]: Multi-machine parallel problem
请问你有解决这个问题吗?我们也是多机多卡有报这个错,我们内部推测是因为没有设置免密登录,不知道你有没有解决,想请教一下。
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Do you have a solution to this problem? We also reported this error with multiple machines and multiple cards. We internally speculate that it is because there is no password-free login. I don’t know if you have solved it. I would like to ask for advice.
你参考的哪个教程
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
Which tutorial did you refer to
This issue could be merged with #2958
@Cloopen-ReLiNK 嗨,帅哥,你这边最终怎么解决的呢?我也碰到了这个问题
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
@Cloopen-ReLiNK Hi dude, how did you end up solving this on your side? I also ran into this problem
请问这个问题怎么解决呢?目前题主解决了吗?
Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑🤝🧑👫🧑🏿🤝🧑🏻👩🏾🤝👨🏿👬🏿
How to solve this problem? Is the subject solved now?
Hi @Vvvvvvsysy @chingfeng2021 @sc-lj @lyzKF Can you each open a new issue describing the specific details of the errors encountered to help reproduce them? Similar errors may be caused by completely different reasons. Thanks.
We have tested multiple servers and GPUs in several different environments with no problems.
Hi @Vvvvvvsysy @chingfeng2021 @sc-lj @lyzKF Can you each open a new issue describing the specific details of the errors encountered to help reproduce them? Similar errors may be caused by completely different reasons. Thanks.
We have tested multiple servers and GPUs in several different environments with no problems.
你好,我也遇到了同样的问题,我想问问如果想要多节点启动的必备因素是什么,比如host1和host2,两边需要在同一个目录准备好同一份训练脚本吗,还是说无需任何准备就可以直接调用目标机器的gpu
我之前是因为节点之前网络不通
| | @.*** | | @.*** |
---- 回复的原邮件 ---- | 发件人 | @.> | | 日期 | 2023年08月11日 09:34 | | 收件人 | @.> | | 抄送至 | @.>@.> | | 主题 | Re: [hpcaitech/ColossalAI] [BUG]: 多机并行 问题 (Issue #2959) |
Hi @@.@@. Can you each open a new issue describing the specific details of the errors encountered to help reproduce them? Similar errors may be caused by completely different reasons. Thanks.
We have tested multiple servers and GPUs in several different environments with no problems.
你好,我也遇到了同样的问题,我想问问如果想要多节点启动的必备因素是什么,比如host1和host2,两边需要在同一个目录准备好同一份训练脚本吗,还是说无需任何准备就可以直接调用目标机器的gpu
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>