oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

Disable IB when there no active IB devices

Open liujuncheng opened this issue 3 years ago • 7 comments

当检测到任意一个节点没有活跃的IB端口时,禁用IB,以免报错 需要测试

  • [ ] 任意节点无活跃端口,如果用户调用init_rdma,提示用户并忽略init_rdma
  • [ ] 否则,执行init_rdma

liujuncheng avatar Sep 20 '22 08:09 liujuncheng

oneflow16+oneflow15

  • [x] oneflow16机器上单机正常情况下(不停止IB服务),init rdma

    state:			PORT_ACTIVE (4)
    
  • [x] oneflow16机器关闭IB服务/etc/init.d/openibd stop,用户开启init rdma

    W20220920 10:44:56.579414 1919887 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
    W20220920 10:44:56.632112 1919886 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
    W20220920 10:44:56.632431 1919885 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
    
  • [x] oneflow16+oneflow15机器上两机正常情况下(不停止IB服务),用户开启init rdma (2机共4卡)

  • [x] oneflow16+oneflow15机器上两机中一台停止IB服务(停止oneflow16),用户开启init rdma,并切换master节点IP地址测试两次 (2机共4卡)

    • [x] 以oneflow15节点做master节点

      W20220920 10:58:48.086185 3492934 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
      W20220920 10:58:48.113924 3492933 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
      
      在主节点会报错
      F20220920 10:58:52.682257 3493300 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO
      *** Check failure stack trace: ***
          @     0x7f59eab050fa  google::LogMessage::Fail()
          @     0x7f59eab053e2  google::LogMessage::SendToLog()
          @     0x7f59eab04c67  google::LogMessage::Flush()
          @     0x7f59eab077d9  google::LogMessageFatal::~LogMessageFatal()
          @     0x7f59e2e408bd  oneflow::(anonymous namespace)::CreateNcclComm()
          @     0x7f59e2e42541  oneflow::EagerNcclCommMgr::GetCommForDevice()
          @     0x7f59e419551b  oneflow::ccl::CudaCommunicationContext::Init()
          @     0x7f59e4452c2a  oneflow::(anonymous namespace)::EagerCclOpKernelCache::Init()
          @     0x7f59e4455e37  oneflow::EagerCclBroadcastKernel::InitOpKernelCacheWithFlags()
          @     0x7f59e5687f60  oneflow::one::StatefulOpKernel::TryInitOpKernelStateAndCache()
          @     0x7f59e3e13f7d  oneflow::vm::OpCallInstructionPolicy::Compute()
          @     0x7f59e3e11c1f  oneflow::vm::EpStreamPolicyBase::Run()
          @     0x7f59e3e19f6a  oneflow::vm::StreamPolicy::RunIf()
          @     0x7f59e3e2110e  oneflow::vm::ThreadCtx::TryReceiveAndRun()
          @     0x7f59e3e23680  oneflow::(anonymous namespace)::WorkerLoop()
          @     0x7f59e3e23a4f  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamTypeEmEUlS6_E3_EEEEE6_M_runEv
          @     0x7f59eab19b3f  execute_native_thread_routine
          @     0x7f5abaca0609  start_thread
          @     0x7f5ababc5133  clone
      
    • [x] 以oneflow16节点做master节点

      同样报上面的错,错误出现在开启了ib服务的节点上
      
  • [x] oneflow15和oneflow16都关闭IB,用户开init rdma

    W20220920 11:03:06.144380 3501785 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
    W20220920 11:03:06.160089 3501786 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable!
    F20220920 11:03:10.630368 3502308 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO
    
    两机同时出现
    

总结

有一台关闭IB或全关闭,init rdma会报错。

ouyangyu avatar Sep 20 '22 11:09 ouyangyu

使用/etc/init.d/openibd stop停止后,无论用户有没有init rdma都会报错。

ibv_devinfo
Failed to get IB devices list: Function not implemented
ibstatus
Fatal error:  No devices
/usr/sbin/ibstatus: 21: exit: Illegal number: -1
  • 以下是nccl debug信息:
oneflow-15:3907318:3907604 [1] NCCL INFO Bootstrap : Using eno1:192.168.1.15<0>
oneflow-15:3907317:3907562 [0] NCCL INFO Bootstrap : Using eno1:192.168.1.15<0>
oneflow-15:3907318:3907604 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
oneflow-15:3907317:3907562 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
oneflow-15:3907318:3907604 [1] NCCL INFO NET/IB : No device found.
oneflow-15:3907317:3907562 [0] NCCL INFO NET/IB : No device found.
oneflow-15:3907318:3907604 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.15<0> [1]veth2ea2288:fe80::78d8:4eff:fe15:8fed%veth2ea2288<0> [2]vethae2223a:fe80::b43d:18ff:fee2:b08f%vethae2223a<0>
oneflow-15:3907318:3907604 [1] NCCL INFO Using network Socket
oneflow-15:3907317:3907562 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.1.15<0> [1]veth2ea2288:fe80::78d8:4eff:fe15:8fed%veth2ea2288<0> [2]vethae2223a:fe80::b43d:18ff:fee2:b08f%vethae2223a<0>
oneflow-15:3907317:3907562 [0] NCCL INFO Using network Socket
oneflow-15:3907318:3907604 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
oneflow-15:3907317:3907562 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
oneflow-15:3907318:3907604 [1] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
oneflow-15:3907317:3907562 [0] NCCL INFO Trees [0] 3/-1/-1->2->0 [1] 3/0/-1->2->-1
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 00/0 : 1[3000] -> 2[2000] [receive] via NET/Socket/1
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 01/0 : 1[3000] -> 2[2000] [receive] via NET/Socket/1
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 00 : 2[2000] -> 3[3000] via direct shared memory
oneflow-15:3907317:3907562 [0] NCCL INFO Channel 01 : 2[2000] -> 3[3000] via direct shared memory
oneflow-15:3907318:3907604 [1] NCCL INFO Channel 00/0 : 3[3000] -> 0[2000] [send] via NET/Socket/1
oneflow-15:3907318:3907604 [1] NCCL INFO Channel 01/0 : 3[3000] -> 0[2000] [send] via NET/Socket/1

oneflow-15:3907318:3907722 [1] misc/socket.cc:450 NCCL WARN Net : Connect to fe80::acbf:6cff:fea5:7120%7<60793> failed : Network is unreachable
oneflow-15:3907318:3907722 [1] NCCL INFO transport/net_socket.cc:354 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO include/net.h:25 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO transport/net.cc:515 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO proxy.cc:914 -> 2
oneflow-15:3907318:3907722 [1] NCCL INFO proxy.cc:942 -> 2

oneflow-15:3907318:3907722 [1] proxy.cc:1042 NCCL WARN [Proxy Service 3] Failed to execute operation Connect from rank 3, retcode 2

oneflow-15:3907318:3907604 [1] misc/socket.cc:523 NCCL WARN Net : Connection closed by remote peer oneflow-15<48415>
oneflow-15:3907318:3907604 [1] NCCL INFO misc/socket.cc:531 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO misc/socket.cc:543 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO proxy.cc:805 -> 2

oneflow-15:3907318:3907604 [1] proxy.cc:808 NCCL WARN Proxy Call to rank 3 failed (Connect)
oneflow-15:3907318:3907604 [1] NCCL INFO transport/net.cc:269 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO transport.cc:127 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:730 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:915 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:951 -> 2
oneflow-15:3907318:3907604 [1] NCCL INFO init.cc:964 -> 2
F20220921 02:33:25.630995 3907604 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO

ouyangyu avatar Sep 21 '22 02:09 ouyangyu

上述问题继续测试了一下:

指定网卡后,export NCCL_SOCKET_IFNAME=eno1,(但这个IP也是IB网卡的IP,没有其他IP)

一台停止IB,一台不停止,nccl log日志显示,一台Using network IB,停止IB驱动的一台会卡主,最终导致多机运行卡了。

两台都停止IB,指定网口是可以跑通的。 感觉这个pr是不是没有必要,因为nccl这边在一台机器上默认会去使用IB。

当然,一台停止IB的情况,可以去其他机器上区分IB网卡IP的机器上试试。

ouyangyu avatar Sep 22 '22 02:09 ouyangyu

但这个IP也是IB网卡的IP,没有其他IP

这个实验是在15、16做的么,192的ip应该是以太网卡的ip,看起来15、16没有配置ipoib(ib网卡的ip)

以上的实验结果中,看起来是nccl没有处理好有的机器ib服务在线有的机器ib服务不在线的问题,可以测试nccl-test验证一下。 如果nccl确实没有处理好的话,可以提示用户使用nccl的环境变量关闭rdma

shangguanshiyuan avatar Sep 22 '22 04:09 shangguanshiyuan

W20220920 11:03:06.144380 3501785 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! W20220920 11:03:06.160089 3501786 env_global_objects_scope.cpp:279] Skip init RDMA because RDMA is unavailable! F20220920 11:03:10.630368 3502308 eager_nccl_comm_manager.cpp:77] Check failed: ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank) : unhandled system error (2). To see more detail, please run OneFlow with system variable NCCL_DEBUG=INFO

看起来这个pr的逻辑是起效的,是nccl报的错

shangguanshiyuan avatar Sep 22 '22 04:09 shangguanshiyuan

感觉这个pr是不是没有必要,因为nccl这边在一台机器上默认会去使用IB。

这个是针对之前用户遇到有IB卡但是没有链接而报错的情况,有一些情况下是单机的,集群中任何一个节点没有端口执行IB链接,那么整个集群都禁用IB是合理的。NCCL 的问题这个问题是独立的

liujuncheng avatar Sep 22 '22 05:09 liujuncheng

目前的测试没有覆盖该pr想解决的问题 openibd是加载网卡驱动的服务,关闭该服务会使得驱动不可用、网卡不可见 代码里还没查询端口可用性之前就返回了,达不到需要测试任意节点无活跃端口,如果用户调用init_rdma,提示用户并忽略init_rdma的目的

shangguanshiyuan avatar Sep 22 '22 09:09 shangguanshiyuan

CI failed when running job: Build cu102. PR label automerge has been removed

github-actions[bot] avatar Sep 27 '22 01:09 github-actions[bot]

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9115/

github-actions[bot] avatar Sep 27 '22 06:09 github-actions[bot]

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.0ms (= 14003.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.4ms (= 16236.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 162.4ms / 140.0ms)

OneFlow resnet50 time: 85.8ms (= 8580.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 108.6ms (= 10857.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 108.6ms / 85.8ms)

OneFlow resnet50 time: 58.4ms (= 11689.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.7ms (= 15747.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 78.7ms / 58.4ms)

OneFlow resnet50 time: 44.4ms (= 8885.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.8ms (= 14363.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.62 (= 71.8ms / 44.4ms)

OneFlow resnet50 time: 40.9ms (= 8189.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.1ms (= 13825.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.69 (= 69.1ms / 40.9ms)

github-actions[bot] avatar Sep 27 '22 06:09 github-actions[bot]

CI failed when running job: cpu-module. PR label automerge has been removed

github-actions[bot] avatar Sep 27 '22 06:09 github-actions[bot]

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.8ms (= 13975.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.5ms (= 16049.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.5ms / 139.8ms)

OneFlow resnet50 time: 85.7ms (= 8569.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.1ms (= 10206.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 102.1ms / 85.7ms)

OneFlow resnet50 time: 57.9ms (= 11588.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.3ms (= 15661.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 78.3ms / 57.9ms)

OneFlow resnet50 time: 45.1ms (= 9028.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14064.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 70.3ms / 45.1ms)

OneFlow resnet50 time: 39.8ms (= 7967.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.6ms (= 15328.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.92 (= 76.6ms / 39.8ms)

github-actions[bot] avatar Sep 28 '22 00:09 github-actions[bot]

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9115/

github-actions[bot] avatar Sep 28 '22 00:09 github-actions[bot]