DeepSpeedExamples
DeepSpeedExamples copied to clipboard
How to train deepspeed-chat using nccl with multi-nodes?
My GPU machines do not have openmpi and some launcher. I want to use the original torch.distributed to train with multi-nodes. But the error is always like this:
[2023-04-27 10:29:22,235] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=INFO
[2023-04-27 10:29:22,235] [INFO] [launch.py:222:main] 0 NCCL_SOCKET_IFNAME=eth1
[2023-04-27 10:29:22,235] [INFO] [launch.py:222:main] 0 NCCL_IB_GID_INDEX=3
[2023-04-27 10:29:22,235] [INFO] [launch.py:222:main] 0 NCCL_IB_SL=3
[2023-04-27 10:29:22,236] [INFO] [launch.py:222:main] 0 NCCL_P2P_DISABLE=0
[2023-04-27 10:29:22,236] [INFO] [launch.py:222:main] 0 NCCL_IB_HCA=mlx5_2:1,mlx5_2:1
[2023-04-27 10:29:22,236] [INFO] [launch.py:222:main] 0 NCCL_LL_THRESHOLD=16384
[2023-04-27 10:29:22,236] [INFO] [launch.py:222:main] 0 NCCL_CHECK_DISABLE=1
[2023-04-27 10:29:22,236] [INFO] [launch.py:222:main] 0 NCCL_IB_CUDA_SUPPORT=1
[2023-04-27 10:29:22,236] [INFO] [launch.py:229:main] WORLD INFO DICT: {'11.214.114.64': [0, 1, 2, 3], '11.214.129.163': [0, 1, 2, 3]}
[2023-04-27 10:29:22,236] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=4, node_rank=0
[2023-04-27 10:29:22,236] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'11.214.114.64': [0, 1, 2, 3], '11.214.129.163': [4, 5, 6, 7]})
[2023-04-27 10:29:22,236] [INFO] [launch.py:247:main] dist_world_size=8
[2023-04-27 10:29:22,236] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-04-27 10:29:25,898] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
mmnewyardnodesz52663:93:93 [0] NCCL INFO Bootstrap : Using eth1:11.214.114.64<0>
mmnewyardnodesz52663:93:93 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mmnewyardnodesz52663:93:93 [0] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth1:11.214.114.64<0>
mmnewyardnodesz52663:93:93 [0] NCCL INFO Using network IB
NCCL version 2.12.12+cuda11.3
mmnewyardnodesz52663:96:96 [3] NCCL INFO Bootstrap : Using eth1:11.214.114.64<0>
mmnewyardnodesz52663:96:96 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mmnewyardnodesz52663:95:95 [2] NCCL INFO Bootstrap : Using eth1:11.214.114.64<0>
mmnewyardnodesz52663:95:95 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mmnewyardnodesz52663:94:94 [1] NCCL INFO Bootstrap : Using eth1:11.214.114.64<0>
mmnewyardnodesz52663:94:94 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mmnewyardnodesz52663:95:95 [2] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth1:11.214.114.64<0>
mmnewyardnodesz52663:95:95 [2] NCCL INFO Using network IB
mmnewyardnodesz52663:96:96 [3] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth1:11.214.114.64<0>
mmnewyardnodesz52663:96:96 [3] NCCL INFO Using network IB
mmnewyardnodesz52663:94:94 [1] NCCL INFO NET/IB : Using [0]mlx5_2:1/RoCE [RO]; OOB eth1:11.214.114.64<0>
mmnewyardnodesz52663:94:94 [1] NCCL INFO Using network IB
mmnewyardnodesz52663:96:249 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
mmnewyardnodesz52663:93:244 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
mmnewyardnodesz52663:94:250 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
mmnewyardnodesz52663:95:248 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
mmnewyardnodesz52663:95:248 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
mmnewyardnodesz52663:96:249 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
mmnewyardnodesz52663:94:250 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
mmnewyardnodesz52663:93:244 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
mmnewyardnodesz52663:93:244 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
mmnewyardnodesz52663:93:244 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
mmnewyardnodesz52663:95:248 [2] NCCL INFO Channel 00 : 2[3d000] -> 3[3e000] via P2P/IPC
mmnewyardnodesz52663:94:250 [1] NCCL INFO Channel 00 : 1[1b000] -> 2[3d000] via P2P/IPC
mmnewyardnodesz52663:95:248 [2] NCCL INFO Channel 01 : 2[3d000] -> 3[3e000] via P2P/IPC
mmnewyardnodesz52663:94:250 [1] NCCL INFO Channel 01 : 1[1b000] -> 2[3d000] via P2P/IPC
mmnewyardnodesz52663:93:244 [0] NCCL INFO Channel 00/0 : 7[3e000] -> 0[1a000] [receive] via NET/IB/0
mmnewyardnodesz52663:93:244 [0] NCCL INFO Channel 01/0 : 7[3e000] -> 0[1a000] [receive] via NET/IB/0
mmnewyardnodesz52663:93:244 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1b000] via P2P/IPC
mmnewyardnodesz52663:96:249 [3] NCCL INFO Channel 00/0 : 3[3e000] -> 4[1a000] [send] via NET/IB/0
mmnewyardnodesz52663:93:244 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1b000] via P2P/IPC
mmnewyardnodesz52663:94:250 [1] NCCL INFO Connected all rings
mmnewyardnodesz52663:94:250 [1] NCCL INFO Channel 00 : 1[1b000] -> 0[1a000] via P2P/IPC
mmnewyardnodesz52663:94:250 [1] NCCL INFO Channel 01 : 1[1b000] -> 0[1a000] via P2P/IPC
mmnewyardnodesz52663:96:249 [3] NCCL INFO Channel 01/0 : 3[3e000] -> 4[1a000] [send] via NET/IB/0
mmnewyardnodesz52663:96:251 [3] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
mmnewyardnodesz52663:95:248 [2] NCCL INFO Connected all rings
mmnewyardnodesz52663:93:252 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
mmnewyardnodesz52663:95:248 [2] NCCL INFO Channel 00 : 2[3d000] -> 1[1b000] via P2P/IPC
mmnewyardnodesz52663:93:252 [0] NCCL INFO NCCL_IB_SL set by environment to 3.
mmnewyardnodesz52663:93:252 [0] misc/ibvwrap.cc:302 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument
mmnewyardnodesz52663:93:252 [0] NCCL INFO transport/net_ib.cc:525 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO transport/net_ib.cc:703 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO include/net.h:26 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO transport/net.cc:631 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO proxy.cc:914 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO proxy.cc:942 -> 2
mmnewyardnodesz52663:93:252 [0] proxy.cc:1040 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
mmnewyardnodesz52663:95:248 [2] NCCL INFO Channel 01 : 2[3d000] -> 1[1b000] via P2P/IPC
mmnewyardnodesz52663:93:244 [0] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer mmnewyardnodesz52663_in<55451>
mmnewyardnodesz52663:93:244 [0] NCCL INFO misc/socket.cc:531 -> 2
mmnewyardnodesz52663:93:244 [0] NCCL INFO misc/socket.cc:543 -> 2
mmnewyardnodesz52663:93:244 [0] NCCL INFO proxy.cc:805 -> 2
mmnewyardnodesz52663:93:244 [0] proxy.cc:808 NCCL WARN Proxy Call to rank 0 failed (Connect)
mmnewyardnodesz52663:93:244 [0] NCCL INFO transport/net.cc:319 -> 2
mmnewyardnodesz52663:93:244 [0] NCCL INFO transport.cc:137 -> 2
mmnewyardnodesz52663:93:244 [0] NCCL INFO init.cc:730 -> 2
mmnewyardnodesz52663:93:244 [0] NCCL INFO init.cc:914 -> 2
mmnewyardnodesz52663:93:244 [0] NCCL INFO group.cc:58 -> 2 [Async thread]
mmnewyardnodesz52663:93:252 [0] misc/ibvwrap.cc:302 NCCL WARN Call to ibv_modify_qp failed with error Invalid argument
mmnewyardnodesz52663:93:252 [0] NCCL INFO transport/net_ib.cc:525 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO transport/net_ib.cc:703 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO include/net.h:26 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO transport/net.cc:631 -> 2
mmnewyardnodesz52663:93:252 [0] NCCL INFO proxy.cc:914 -> 2
mmnewyardnodesz52663:93:252 [0] proxy.cc:1040 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2
Traceback (most recent call last):
File "/hf/DeepSpeedExamples/applications/DeepSpeedChat/training/step1_supervised_finetuning/training_scripts/multi_node/../../main.py", line 344, in <module>
main()
File "/hf/DeepSpeedExamples/applications/DeepSpeedChat/training/step1_supervised_finetuning/training_scripts/multi_node/../../main.py", line 203, in main
torch.distributed.barrier()
File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.12
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
mmnewyardnodesz52663:96:249 [3] NCCL INFO Connected all rings
mmnewyardnodesz52663:96:249 [3] NCCL INFO Channel 00 : 3[3e000] -> 2[3d000] via P2P/IPC
mmnewyardnodesz52663:96:249 [3] NCCL INFO Channel 01 : 3[3e000] -> 2[3d000] via P2P/IPC
mmnewyardnodesz52663:96:249 [3] NCCL INFO Connected all trees
mmnewyardnodesz52663:96:249 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
mmnewyardnodesz52663:96:249 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
mmnewyardnodesz52663:95:248 [2] NCCL INFO Connected all trees
mmnewyardnodesz52663:95:248 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/512
mmnewyardnodesz52663:95:248 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
mmnewyardnodesz52663:96:249 [3] misc/socket.cc:503 NCCL WARN Net : Call to recv from 11.214.114.64<45613> failed : Connection refused
mmnewyardnodesz52663:96:249 [3] NCCL INFO misc/socket.cc:520 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO misc/socket.cc:531 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO misc/socket.cc:537 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO bootstrap.cc:60 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO bootstrap.cc:321 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO bootstrap.cc:341 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO init.cc:885 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO init.cc:914 -> 2
mmnewyardnodesz52663:96:249 [3] NCCL INFO group.cc:58 -> 2 [Async thread]
Traceback (most recent call last):
File "/hf/DeepSpeedExamples/applications/DeepSpeedChat/training/step1_supervised_finetuning/training_scripts/multi_node/../../main.py", line 344, in <module>
main()
File "/hf/DeepSpeedExamples/applications/DeepSpeedChat/training/step1_supervised_finetuning/training_scripts/multi_node/../../main.py", line 203, in main
torch.distributed.barrier()
File "/usr/local/python/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.12
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
[2023-04-27 10:33:12,457] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 93
[2023-04-27 10:33:12,458] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 94
[2023-04-27 10:33:12,674] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 95
[2023-04-27 10:33:12,967] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 96
RuntimeError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled system error, NCCL version 21.2.12
I have forced Deepspeed to use deepspeed.laucher.launch not other launchers like openmpi, but I still get this error.
直接用deepseed cli启动
直接用deepseed cli启动
请问deepspeed是怎么知道用哪些node的?需要准备一个ip list文件?
@LarryZhangy deepspeed默认读/job/hostfile,这样的格式 $worker slots=${slots}"
I have the same problem, does anyone know how to solve this?