oneflow
oneflow copied to clipboard
Disable IB when there no active IB devices
当检测到任意一个节点没有活跃的IB端口时,禁用IB,以免报错 需要测试
- [ ] 任意节点无活跃端口,如果用户调用init_rdma,提示用户并忽略init_rdma
- [ ] 否则,执行init_rdma
oneflow16+oneflow15
-
[x] oneflow16机器上单机正常情况下(不停止IB服务),init rdma
state: PORT_ACTIVE (4) -
[x] oneflow16机器关闭IB服务
/etc/init.d/openibd stop,用户开启init rdmaW20220920 10:44:56.579414 1919887 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 10:44:56.632112 1919886 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 10:44:56.632431 1919885 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! -
[x] oneflow16+oneflow15机器上两机正常情况下(不停止IB服务),用户开启init rdma (2机共4卡)
-
[x] oneflow16+oneflow15机器上两机中一台停止IB服务(停止oneflow16),用户开启init rdma,并切换master节点IP地址测试两次 (2机共4卡)
-
[x] 以oneflow15节点做master节点
W20220920 10:58:48.086185 3492934 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 10:58:48.113924 3492933 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! 在主节点会报错 F20220920 10:58:52.682257 3493300 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO *** Check failure stack trace: *** @ 0x7f59eab050fa google::LogMessage::Fail() @ 0x7f59eab053e2 google::LogMessage::SendToLog() @ 0x7f59eab04c67 google::LogMessage::Flush() @ 0x7f59eab077d9 google::LogMessageFatal::~LogMessageFatal() @ 0x7f59e2e408bd oneflow::(anonymous namespace)::CreateNcclComm() @ 0x7f59e2e42541 oneflow::EagerNcclCommMgr::GetCommForDevice() @ 0x7f59e419551b oneflow::ccl::CudaCommunicationContext::Init() @ 0x7f59e4452c2a oneflow::(anonymous namespace)::EagerCclOpKernelCache::Init() @ 0x7f59e4455e37 oneflow::EagerCclBroadcastKernel::InitOpKernelCacheWithFlags() @ 0x7f59e5687f60 oneflow::one::StatefulOpKernel::TryInitOpKernelStateAndCache() @ 0x7f59e3e13f7d oneflow::vm::OpCallInstructionPolicy::Compute() @ 0x7f59e3e11c1f oneflow::vm::EpStreamPolicyBase::Run() @ 0x7f59e3e19f6a oneflow::vm::StreamPolicy::RunIf() @ 0x7f59e3e2110e oneflow::vm::ThreadCtx::TryReceiveAndRun() @ 0x7f59e3e23680 oneflow::(anonymous namespace)::WorkerLoop() @ 0x7f59e3e23a4f _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamTypeEmEUlS6_E3_EEEEE6_M_runEv @ 0x7f59eab19b3f execute_native_thread_routine @ 0x7f5abaca0609 start_thread @ 0x7f5ababc5133 clone -
[x] 以oneflow16节点做master节点
同样报上面的错,错误出现在开启了ib服务的节点上
-
-
[x] oneflow15和oneflow16都关闭IB,用户开init rdma
W20220920 11:03:06.144380 3501785 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 11:03:06.160089 3501786 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! F20220920 11:03:10.630368 3502308 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO 两机同时出现
总结
有一台关闭IB或全关闭,init rdma会报错。
使用/etc/init.d/openibd stop停止后,无论用户有没有init rdma都会报错。
ibv_devinfo
Failed to get IB devices list: Function not implemented
ibstatus
Fatal error: No devices
/usr/sbin/ibstatus: 21: exit: Illegal number: -1
- 以下是nccl debug信息:
oneflow-15:3907318:3907604 [1] NCCL INFO Bootstrap : Using eno1:192.168.1.15<0>
oneflow-15:3907317:3907562 [0] NCCL INFO Bootstrap : Using eno1:192.168.1.15<0>
oneflow-15:3907318:3907604 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
oneflow-15:3907317:3907562 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
oneflow-15:3907318:3907604 [1] NCCL INFO NET/IB : No device found.
oneflow-15:3907317:3907562 [0] NCCL INFO NET/IB : No device found.
oneflow-15:3907318:3907604 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.15<0> [1]veth2ea2288:fe80::78d8:4eff:fe15:8fed%veth2ea2288<0> [2]vethae2223a:fe80::b43d:18ff:fee2:b08f%vethae2223a<0>
oneflow-15:3907318:3907604 [1] NCCL INFO Using network Socket
oneflow-15:3907317:3907562 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.15<0> [1]veth2ea2288:fe80::78d8:4eff:fe15:8fed%veth2ea2288<0> [2]vethae2223a:fe80::b43d:18ff:fee2:b08f%vethae2223a<0>
oneflow-15:3907317:3907562 [0] NCCL INFO Using network Socket
oneflow-15:3907318:3907604 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
oneflow-15:3907317:3907562 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
oneflow-15:3907318:3907604 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
oneflow-15:3907317:3907562 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 00/0 : 1[3000] -> 2[2000] [receive] via NET/Socket/1
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 01/0 : 1[3000] -> 2[2000] [receive] via NET/Socket/1
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 00 : 2[2000] -> 3[3000] via direct shared memory
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 01 : 2[2000] -> 3[3000] via direct shared memory
oneflow-15:3907318:3907604 [1] NCCL INFO Channel 00/0 : 3[3000] -> 0[2000] [send] via NET/Socket/1
oneflow-15:3907318:3907604 [1] NCCL INFO Channel 01/0 : 3[3000] -> 0[2000] [send] via NET/Socket/1
oneflow-15:3907318:3907722 [1] misc/socket.cc:450 NCCL WARN Net : Connect to fe80::acbf:6cff:fea5:7120%7<60793> failed : Network is unreachable
oneflow-15:3907318:3907722 [1] NCCL INFO transport/net_socket.cc:354 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO include/net.h:25 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO transport/net.cc:515 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO proxy.cc:914 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO proxy.cc:942 -> 2
oneflow-15:3907318:3907722 [1] proxy.cc:1042 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 2
oneflow-15:3907318:3907604 [1] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer oneflow-15<48415>
oneflow-15:3907318:3907604 [1] NCCL INFO misc/socket.cc:531 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO misc/socket.cc:543 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO proxy.cc:805 -> 2
oneflow-15:3907318:3907604 [1] proxy.cc:808 NCCL WARN Proxy Call to rank 3 failed (Connect)
oneflow-15:3907318:3907604 [1] NCCL INFO transport/net.cc:269 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO transport.cc:127 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:730 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:915 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:951 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:964 -> 2
F20220921 02:33:25.630995 3907604 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO
上述问题继续测试了一下:
指定网卡后,export NCCL_SOCKET_IFNAME=eno1,(但这个IP也是IB网卡的IP,没有其他IP)
一台停止IB,一台不停止,nccl log日志显示,一台Using network IB,停止IB驱动的一台会卡主,最终导致多机运行卡了。
两台都停止IB,指定网口是可以跑通的。 感觉这个pr是不是没有必要,因为nccl这边在一台机器上默认会去使用IB。
当然,一台停止IB的情况,可以去其他机器上区分IB网卡IP的机器上试试。
但这个IP也是IB网卡的IP,没有其他IP
这个实验是在15、16做的么,192的ip应该是以太网卡的ip,看起来15、16没有配置ipoib(ib网卡的ip)
以上的实验结果中,看起来是nccl没有处理好有的机器ib服务在线有的机器ib服务不在线的问题,可以测试nccl-test验证一下。 如果nccl确实没有处理好的话,可以提示用户使用nccl的环境变量关闭rdma
W20220920 11:03:06.144380 3501785 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 11:03:06.160089 3501786 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! F20220920 11:03:10.630368 3502308 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO
看起来这个pr的逻辑是起效的,是nccl报的错
感觉这个pr是不是没有必要,因为nccl这边在一台机器上默认会去使用IB。
这个是针对之前用户遇到有IB卡但是没有链接而报错的情况,有一些情况下是单机的,集群中任何一个节点没有端口执行IB链接,那么整个集群都禁用IB是合理的。NCCL 的问题这个问题是独立的
目前的测试没有覆盖该pr想解决的问题
openibd是加载网卡驱动的服务,关闭该服务会使得驱动不可用、网卡不可见
代码里还没查询端口可用性之前就返回了,达不到需要测试任意节点无活跃端口,如果用户调用init_rdma,提示用户并忽略init_rdma的目的
CI failed when running job: Build cu102. PR label automerge has been removed
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9115/
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 140.0ms (= 14003.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.4ms (= 16236.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 162.4ms / 140.0ms)
OneFlow resnet50 time: 85.8ms (= 8580.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 108.6ms (= 10857.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 108.6ms / 85.8ms)
OneFlow resnet50 time: 58.4ms (= 11689.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.7ms (= 15747.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 78.7ms / 58.4ms)
OneFlow resnet50 time: 44.4ms (= 8885.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.8ms (= 14363.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.62 (= 71.8ms / 44.4ms)
OneFlow resnet50 time: 40.9ms (= 8189.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.1ms (= 13825.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.69 (= 69.1ms / 40.9ms)
CI failed when running job: cpu-module. PR label automerge has been removed
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 139.8ms (= 13975.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.5ms (= 16049.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.5ms / 139.8ms)
OneFlow resnet50 time: 85.7ms (= 8569.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.1ms (= 10206.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 102.1ms / 85.7ms)
OneFlow resnet50 time: 57.9ms (= 11588.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.3ms (= 15661.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 78.3ms / 57.9ms)
OneFlow resnet50 time: 45.1ms (= 9028.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14064.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 70.3ms / 45.1ms)
OneFlow resnet50 time: 39.8ms (= 7967.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.6ms (= 15328.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.92 (= 76.6ms / 39.8ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9115/