Error: "transport retry counter exceeded" when torch.distributed.rpc.init_rpc between different pods in k8s
Hello, my code is running in the k8s environment. I started pytorch in two pods and tried to use torchrpc , but I encountered an error in the torch.distributed.rpc.init_rpc function. Hope to get some advice or inspiration.
Code
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Test torch rpc')
parser.add_argument('--rank', type=int)
parser.add_argument('--init_method', type=str)
rank = args.rank
init_method = args.init_method
rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=8, init_method=init_method)
if rank == 1:
rpc_backend_options.set_device_map("B", {0: 0})
name = "A"
elif rank == 0:
rpc_backend_options.set_device_map("A", {0: 0})
name = "B"
logging.debug(torch.__version__)
logging.debug("name:{}, rank:{}, init_method:{}".format(name, rank, init_method))
rpc.init_rpc(name, rank=rank, world_size=world_size, rpc_backend_options=rpc_backend_options)
Error
I add TP_VERBOSE_LOGGING so it print tensorpipe debug info. It can be seen that the fatal error is IBV_WC_RETRY_EXC_ERR: Transport Retry Counter Exceeded. This means that the remote side didn't send any Ack or Nack. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. But in fact the other process does not report an error, but waits until the init_rpc times out.
Here is the rank 0 log, name is B:
[09-21 07:24:27] DEBUG 1.12.1 func.py:67
[09-21 07:24:27] DEBUG name:B, rank:0, init_method:tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516 func.py:67
V0921 07:24:29.916767 89 tensorpipe/core/context_impl.cc:53] Context 89:c0 created
V0921 07:24:29.916805 89 tensorpipe/core/context_impl.cc:55] Context 89:c0 aliased as B
V0921 07:24:29.930452 89 tensorpipe/core/context_impl.cc:81] Context B is registering transport ibv
V0921 07:24:29.930539 89 tensorpipe/core/context_impl.cc:81] Context B is registering transport uv
V0921 07:24:29.932656 89 tensorpipe/core/context_impl.cc:81] Context B is registering transport shm
V0921 07:24:29.932819 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_basic
V0921 07:24:29.932829 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_basic
V0921 07:24:29.932836 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_basic.cpu
V0921 07:24:29.935908 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_xth
V0921 07:24:29.935921 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_xth
V0921 07:24:29.935981 89 tensorpipe/channel/cma/context_impl.cc:225] Couldn't detect the active Linux Security Modules
V0921 07:24:29.936031 89 tensorpipe/channel/cma/context_impl.cc:253] YAMA ptrace scope set to classic ptrace permissions
V0921 07:24:29.936044 89 tensorpipe/channel/cma/context_impl.cc:296] The domain descriptor for CMA is cma:6783c3f0-b9c5-408b-80b7-9fd2287768d8_1000d9_f0001d4c_1000d9_effffffd_0_0_a80425fb
V0921 07:24:29.936110 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cma
V0921 07:24:29.936117 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cma
V0921 07:24:29.964395 89 tensorpipe/channel/cuda_ipc/context_impl.cc:254] The UUIDs of all the GPUs found by the CUDA IPC channel are 53dd857c-914a-4f99-7bcc-0af7e3edc566
V0921 07:24:29.964409 89 tensorpipe/channel/cuda_ipc/context_impl.cc:256] The peer-to-peer support found by the CUDA IPC channel is {{1}}
V0921 07:24:29.964435 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_ipc
V0921 07:24:29.964445 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_ipc
V0921 07:24:29.964457 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel basic
V0921 07:24:29.964463 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_basic
V0921 07:24:29.995572 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel mpt_uv
V0921 07:24:29.995585 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_mpt_uv
V0921 07:24:29.995705 89 tensorpipe/channel/cuda_gdr/context_impl.cc:479] CUDA GDR channel is not viable because the nv_peer_mem kernel module isn't active
V0921 07:24:29.995721 89 tensorpipe/core/context_impl.cc:116] Context B is opening listener B[l0]
V0921 07:24:29.996051 89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport shm
V0921 07:24:29.996061 89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport ibv
V0921 07:24:29.996065 89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport uv
V0921 07:24:30.330784 89 tensorpipe/core/listener_impl.cc:102] Listener B[l0] received an accept request (#0)
V0921 07:24:30.331219 89 tensorpipe/core/context_impl.cc:128] Context B is opening pipe B.p0
V0921 07:24:30.331241 89 tensorpipe/core/context_impl.cc:132] Pipe B.p0 aliased as B_to_A
V0921 07:24:30.331300 89 tensorpipe/core/pipe_impl.cc:286] Pipe B_to_A is writing nop object (spontaneous connection)
V0921 07:24:30.331342 89 tensorpipe/core/pipe_impl.cc:311] Pipe B_to_A is writing nop object (brochure)
V0921 07:24:30.331349 89 tensorpipe/core/pipe_impl.cc:319] Pipe B_to_A is reading nop object (brochure answer)
V0921 07:24:30.331715 89 tensorpipe/core/pipe_impl.cc:290] Pipe B_to_A done writing nop object (spontaneous connection)
V0921 07:24:30.331771 89 tensorpipe/core/pipe_impl.cc:314] Pipe B_to_A done writing nop object (brochure)
V0921 07:24:30.332378 89 tensorpipe/core/pipe_impl.cc:322] Pipe B_to_A done reading nop object (brochure answer)
V0921 07:24:30.332393 89 tensorpipe/core/pipe_impl.cc:1113] Pipe B_to_A is opening connection (descriptor, as replacement)
V0921 07:24:30.332434 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332448 89 tensorpipe/core/pipe_impl.cc:1131] Pipe B_to_A is opening connection (descriptor_reply)
V0921 07:24:30.332456 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332527 89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 0/2 (for channel cuda_basic)
V0921 07:24:30.332538 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332542 89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 1/2 (for channel cuda_basic)
V0921 07:24:30.332550 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332565 89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_cuda_basic.cpu is opening channel B.ch_cuda_basic.cpu.c0
V0921 07:24:30.332585 89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_cuda_basic is opening channel B.ch_cuda_basic.c0
V0921 07:24:30.332608 89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_cuda_basic.c0 was renamed to B_to_A.ch_cuda_basic
V0921 07:24:30.332622 89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_cuda_basic.cpu.c0 was renamed to B_to_A.ch_cuda_basic.cpu
V0921 07:24:30.332631 89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 0/1 (for channel mpt_uv)
V0921 07:24:30.332638 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332649 89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_mpt_uv is opening channel B.ch_mpt_uv.c0
V0921 07:24:30.332680 89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_mpt_uv.c0 was renamed to B_to_A.ch_mpt_uv
V0921 07:24:30.343321 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343622 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343782 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343939 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.345877 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:31.525355 89 tensorpipe/core/pipe_impl.cc:550] Pipe B_to_A received a write request (#0, contaning 4 payloads and 0 tensors)
V0921 07:24:31.525398 89 tensorpipe/core/pipe_impl.cc:925] Pipe B_to_A is writing nop object (message descriptor #0)
V0921 07:24:31.525412 89 tensorpipe/core/pipe_impl.cc:942] Pipe B_to_A is writing payloads of message #0
V0921 07:24:31.525415 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.0
V0921 07:24:31.525421 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.1
V0921 07:24:31.525425 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.2
V0921 07:24:31.525428 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.3
V0921 07:24:31.525434 89 tensorpipe/core/pipe_impl.cc:880] Pipe B_to_A is sending tensors of message #0
V0921 07:24:31.525495 89 tensorpipe/core/pipe_impl.cc:931] Pipe B_to_A done writing nop object (message descriptor #0)
V0921 07:24:31.525575 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.0
V0921 07:24:31.525598 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.1
V0921 07:24:31.525622 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.2
V0921 07:24:31.525634 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.3
V0921 07:24:31.525640 89 tensorpipe/core/pipe_impl.cc:557] Pipe B_to_A is calling a write callback (#0)
V0921 07:24:31.525671 89 tensorpipe/core/pipe_impl.cc:560] Pipe B_to_A done calling a write callback (#0)
V0921 07:24:31.525692 89 tensorpipe/core/pipe_impl.cc:376] Pipe B_to_A received a readDescriptor request (#0)
V0921 07:24:31.525707 89 tensorpipe/core/pipe_impl.cc:845] Pipe B_to_A is reading nop object (message descriptor #0)
V0921 07:24:34.575529 89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_mpt_uv is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576221 89 tensorpipe/core/pipe_impl.cc:634] Pipe B_to_A is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576272 89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_mpt_uv is closing
V0921 07:24:34.576319 89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_cuda_basic is closing
V0921 07:24:34.576328 89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_cuda_basic is handling error channel closed (this error originated at tensorpipe/channel/channel_impl_boilerplate.h:265)
V0921 07:24:34.576354 89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_cuda_basic.cpu is closing
V0921 07:24:34.576362 89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_cuda_basic.cpu is handling error channel closed (this error originated at tensorpipe/channel/channel_impl_boilerplate.h:265)
V0921 07:24:34.576381 89 tensorpipe/core/pipe_impl.cc:849] Pipe B_to_A done reading nop object (message descriptor #0)
V0921 07:24:34.576391 89 tensorpipe/core/pipe_impl.cc:382] Pipe B_to_A is calling a readDescriptor callback (#0)
[W tensorpipe_agent.cpp:940] RPC agent for B encountered error when reading incoming response from A: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576442 89 tensorpipe/core/pipe_impl.cc:385] Pipe B_to_A done calling a readDescriptor callback (#0)
Traceback (most recent call last):
File "test_dist.py", line 470, in <module>
setup(int(args.rank), 2, False, str(args.init_method))
File "test_dist.py", line 425, in setup
rpc.init_rpc(name, rank=rank, world_size=world_size, rpc_backend_options=rpc_backend_options)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
rpc_backend_options=rpc_backend_options,
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 359, in _tensorpipe_init_backend_handler
api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 226, in _all_gather
timeout=rpc_timeout,
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 800, in rpc_sync
return fut.wait()
RuntimeError: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.744256 89 tensorpipe/core/context_impl.cc:220] Context B is closing
V0921 07:24:34.744301 89 tensorpipe/core/context_impl.cc:238] Context B is handling error context closed (this error originated at tensorpipe/core/context_impl.cc:221)
Here is the rank 1 log, name is A:
tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516
[09-21 09:01:44] DEBUG not set NCCL_IB_GID_INDEX func.py:67
[09-21 09:01:44] DEBUG 1.12.1 func.py:67
[09-21 09:01:44] DEBUG name:A, rank:1, init_method:tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516 func.py:67
V0921 09:01:44.481227 89 tensorpipe/core/context_impl.cc:53] Context 89:c0 created
V0921 09:01:44.481273 89 tensorpipe/core/context_impl.cc:55] Context 89:c0 aliased as A
V0921 09:01:44.494768 89 tensorpipe/core/context_impl.cc:81] Context A is registering transport ibv
V0921 09:01:44.494844 89 tensorpipe/core/context_impl.cc:81] Context A is registering transport uv
V0921 09:01:44.497126 89 tensorpipe/core/context_impl.cc:81] Context A is registering transport shm
V0921 09:01:44.497258 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_basic
V0921 09:01:44.497267 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_basic
V0921 09:01:44.497273 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_basic.cpu
V0921 09:01:44.502026 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_xth
V0921 09:01:44.502036 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_xth
V0921 09:01:44.502157 89 tensorpipe/channel/cma/context_impl.cc:225] Couldn't detect the active Linux Security Modules
V0921 09:01:44.502223 89 tensorpipe/channel/cma/context_impl.cc:253] YAMA ptrace scope set to classic ptrace permissions
V0921 09:01:44.502234 89 tensorpipe/channel/cma/context_impl.cc:296] The domain descriptor for CMA is cma:3d1e04ad-0931-4975-b822-467e2b3b1cee_100085_f000197b_100085_effffffd_0_0_a80425fb
V0921 09:01:44.502283 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cma
V0921 09:01:44.502290 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cma
V0921 09:01:44.538451 89 tensorpipe/channel/cuda_ipc/context_impl.cc:254] The UUIDs of all the GPUs found by the CUDA IPC channel are 19071872-e206-4660-2142-9651e3f148a8
V0921 09:01:44.538464 89 tensorpipe/channel/cuda_ipc/context_impl.cc:256] The peer-to-peer support found by the CUDA IPC channel is {{1}}
V0921 09:01:44.538490 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_ipc
V0921 09:01:44.538497 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_ipc
V0921 09:01:44.538509 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel basic
V0921 09:01:44.538515 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_basic
V0921 09:01:44.580634 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel mpt_uv
V0921 09:01:44.580645 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_mpt_uv
V0921 09:01:44.580795 89 tensorpipe/channel/cuda_gdr/context_impl.cc:479] CUDA GDR channel is not viable because the nv_peer_mem kernel module isn't active
V0921 09:01:44.580809 89 tensorpipe/core/context_impl.cc:116] Context A is opening listener A[l0]
V0921 09:01:44.581122 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport shm
V0921 09:01:44.581132 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:44.581136 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport uv
V0921 09:01:44.754984 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport uv
V0921 09:01:44.755039 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:44.755054 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport uv
V0921 09:01:44.755066 89 tensorpipe/core/listener_impl.cc:102] Listener A[l0] received an accept request (#0)
V0921 09:01:44.755142 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:44.755171 89 tensorpipe/core/listener_impl.cc:293] Listener A[l0] got spontaneous connection
V0921 09:01:44.755259 89 tensorpipe/core/listener_impl.cc:295] Listener A[l0] is opening pipe A[l0].p0
V0921 09:01:44.755266 89 tensorpipe/core/listener_impl.cc:299] Pipe A[l0].p0 aliased as A[l0]_from_B
V0921 09:01:44.755295 89 tensorpipe/core/pipe_impl.cc:332] Pipe A[l0]_from_B is reading nop object (brochure)
V0921 09:01:44.755305 89 tensorpipe/core/listener_impl.cc:108] Listener A[l0] is calling an accept callback (#0)
V0921 09:01:44.755323 89 tensorpipe/core/listener_impl.cc:111] Listener A[l0] done calling an accept callback (#0)
V0921 09:01:44.755331 89 tensorpipe/core/listener_impl.cc:102] Listener A[l0] received an accept request (#1)
V0921 09:01:44.755341 89 tensorpipe/core/pipe_impl.cc:376] Pipe A[l0]_from_B received a readDescriptor request (#0)
V0921 09:01:44.755433 89 tensorpipe/core/pipe_impl.cc:335] Pipe A[l0]_from_B done reading nop object (brochure)
V0921 09:01:44.755483 89 tensorpipe/core/pipe_impl.cc:1046] Pipe A[l0]_from_B is requesting connection (as replacement)
V0921 09:01:44.755557 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#0)
V0921 09:01:44.755566 89 tensorpipe/core/pipe_impl.cc:1046] Pipe A[l0]_from_B is requesting connection (as replacement)
V0921 09:01:44.755571 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#1)
V0921 09:01:44.755706 89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 0/2 (for channel cuda_basic)
V0921 09:01:44.755761 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#2)
V0921 09:01:44.755765 89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 1/2 (for channel cuda_basic)
V0921 09:01:44.755768 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#3)
V0921 09:01:44.755774 89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 0/1 (for channel mpt_uv)
V0921 09:01:44.755777 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#4)
V0921 09:01:44.755781 89 tensorpipe/core/pipe_impl.cc:1028] Pipe A[l0]_from_B is writing nop object (brochure answer)
V0921 09:01:44.755965 89 tensorpipe/core/pipe_impl.cc:1031] Pipe A[l0]_from_B done writing nop object (brochure answer)
V0921 09:01:45.056188 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.056226 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.056237 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.256277 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.256319 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.256326 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.259347 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.259359 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.259363 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.261365 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.261375 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.261379 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.263591 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.263600 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.263604 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:49.328515 89 tensorpipe/core/listener_impl.cc:207] Listener A[l0] is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328582 89 tensorpipe/core/listener_impl.cc:108] Listener A[l0] is calling an accept callback (#1)
[W tensorpipe_agent.cpp:530] RPC agent for A encountered error when accepting incoming pipe: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328604 89 tensorpipe/core/listener_impl.cc:111] Listener A[l0] done calling an accept callback (#1)
V0921 09:01:49.328610 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#4)
V0921 09:01:49.328617 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#4)
V0921 09:01:49.328621 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#3)
V0921 09:01:49.328624 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#3)
V0921 09:01:49.328628 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#2)
V0921 09:01:49.328631 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#2)
V0921 09:01:49.328634 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#0)
V0921 09:01:49.328639 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#0)
V0921 09:01:49.328642 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#1)
V0921 09:01:49.328658 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#1)
V0921 09:01:49.328689 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.328698 89 tensorpipe/core/pipe_impl.cc:634] Pipe A[l0]_from_B is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328707 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#1)
V0921 09:01:49.328712 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#0)
V0921 09:01:49.328716 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#4)
V0921 09:01:49.328720 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#2)
V0921 09:01:49.328726 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#3)
V0921 09:01:49.328733 89 tensorpipe/core/pipe_impl.cc:382] Pipe A[l0]_from_B is calling a readDescriptor callback (#0)
[W tensorpipe_agent.cpp:726] RPC agent for A encountered error when reading incoming request from B: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328756 89 tensorpipe/core/pipe_impl.cc:385] Pipe A[l0]_from_B done calling a readDescriptor callback (#0)
V0921 09:01:49.328765 89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 0/1 (for channel mpt_uv)
V0921 09:01:49.328770 89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 1/2 (for channel cuda_basic)
V0921 09:01:49.328776 89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 0/2 (for channel cuda_basic)
V0921 09:01:49.328780 89 tensorpipe/core/pipe_impl.cc:1052] Pipe A[l0]_from_B done requesting connection (as replacement)
V0921 09:01:49.328787 89 tensorpipe/core/pipe_impl.cc:1052] Pipe A[l0]_from_B done requesting connection (as replacement)
V0921 09:01:49.328790 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport uv
V0921 09:01:49.328797 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport shm
V0921 09:01:49.328802 89 tensorpipe/core/pipe_impl.cc:355] Pipe A[l0]_from_B is closing
V0921 09:01:49.329226 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:49.329238 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.329491 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.329795 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.330076 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
I found issue#452 very similar to my problem , but my network environment is infiniband, not ROCE. And checked the LID and GID of the running environment, as follows: //ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.2006
Hardware version: 0
Node GUID: 0x043f720300fc827c
System image GUID: 0x043f720300fc827c
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 141
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x043f720300fc827c
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.2006
Hardware version: 0
Node GUID: 0x043f720300fc8288
System image GUID: 0x043f720300fc8288
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x043f720300fc8288
Link layer: InfiniBand
//show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:043f:7203:00fc:827c v1
mlx5_1 1 0 fe80:0000:0000:0000:043f:7203:00fc:8288 v1
Because LID (Local ID) is a layer 2 attribute in the infiniband protocol stack, this attribute does not need to be set in the RoCE network. When this port is queried, this field is 0. And I read the src code of tensorpipe, if LID is not 0, GID index has no effect on qp_attr . So I think my error may be not caused by kGlobalIdentifierIndex.
tensorpipe/common/ibv.cc:117
// Global routing is only set up as far as needed to support RoCE.
attrMask |= IbvLib::QP_AV;
if (destinationInfo.localIdentifier != 0) {
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = destinationInfo.localIdentifier;
} else {
attr.ah_attr.is_global = 1;
attr.ah_attr.grh.dgid = destinationInfo.globalIdentifier;
attr.ah_attr.grh.sgid_index = selfAddr.globalIdentifierIndex;
attr.ah_attr.grh.hop_limit = 1;
}
In addition, I used pytorch's NCCL init_process_group API in the same environment and did not reproduce the above error. And if I start two processes in the same Pod and run the same torch rpc code, everything works fine.
Container Env Versions
pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0 pytorch pytorch-mutex 1.0 cuda pytorch torchaudio 0.12.1 py38_cu113 pytorch torchelastic 0.2.1 pypi_0 pypi torchvision 0.13.1 py38_cu113 pytorch
MLNX_OFED_LINUX-4.6-1.0.1.1 ubuntu 18.04
It is speculated that the version of MLNX_OFED may be too low, and I am working on this.
After troubleshooting, I finally determined the cause of this problem. On the k8s training cluster in our lab, each compute node is equipped with two IB network cards, but only one IB card can be used, and the other card is not plugged in. The ibstatus command shows the following status:
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.2006
Hardware version: 0
Node GUID: 0x043f720300e71090
System image GUID: 0x043f720300e71090
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 147
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x043f720300e71090
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.2006
Hardware version: 0
Node GUID: 0x043f720300e71098
System image GUID: 0x043f720300e71098
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x043f720300e71098
Link layer: InfiniBand
It can be seen that the status of the network card "mlx5_1" is down. But tensorpipe selects the first device returned by ibv_get_device_list by default when selecting the ib device, and the code location is here. But unfortunately this device is "mlx5_1" . I debugged tensorpipe and traced it using gdb.
So I made a slight modification at here, so that it will traverse the deviceList, and only select the devices whose port status is active, and I am very happy to find that tensorpipe can work perfectly.
However, this strategy may lead to inconsistent device selection on both sides of the communication. For example, A has two IB network cards that work well, while B has a good IB card and a bad IB card, which may still cause errors (if the two IB cards cards have different network configurations). In addition to this, the order of devices returned by ibv_get_device_list may be indeterminate. Therefore, if the two communicating parties can negotiate each other's IB devices and ports using sockets or other methods in advance, it should be a more robust implementation.
tensorpipe is the best rdma-enabled rpc library I've seen, and our projects use it heavily. I very much hope that I can do my best to contribute to it. I open a PR and feel free to make further changes based on review comments. thanks : )
Love to hear that you're finding TensorPipe useful (and would be interested in knowing more what you're doing with it).
And you're right that the IB device selection of the ibv transport is very primitive, contrary to the one of the gdr channel (which tries to pair GPUs to their nearest NIC). We didn't deem it important because the ibv transport is only used for small control messages thus its performance isn't super critical, but we didn't expect correctness issues like the ones you faced. Some features that could be implemented to improve this are status detection, a device selection flag, potentially opening multiple devices and multiplexing over them, incorporating NUMA awareness, etc.
I'll review your PR and gladly land it. Unfortunately we've stopped actively investing in improving TensorPipe, hence we won't be able to do any of the above changes ourselves. But if you're willing to contribute more fixes I'll happily accept them.
if the two IB cards cards have different network configurations
One assumption we made, which I believe will be very hard to relax, is that all the IB cards on all the nodes can talk to each other. This is basically always true for the TCP/IP network, and thus we thought we could impose it also for the IB network. If we wanted to support cases where it isn't so, then we'd need a way to detect if two HCAs are in the same network, and AFAIK this isn't easily doable (except if one has UMAD access, but it tends to require root privileges). Do you have any ideas?
Hi Luca @lw great to hear your reply.
and would be interested in knowing more what you're doing with it).
The main scenario where we use tensorpipe is reinforcement learning (RL) training, which mainly consists of 2 different processes: the collector process (which may contain multiple subprocesses) is responsible for interacting with the RL environment (using CPU) and performing model inference (using GPU); learner is responsible for model training (using GPU)
The data flow of collector and learner is:
- The learner needs to transfer the model located in the GPU memory to the collector's GPU memory.
- The collector needs to transfer the observation data of its own environment (located in the host memory) to the GPU memory of the learner.
In most cases, there is one learner and multiple collectors. The number of collectors may dynamically increase or decrease based on demand.
In our k8s environment, these workers are encapsulated in Pods and scheduled to run on physical nodes. Pods use containers to isolate processes, so even if two Pods are scheduled to the same physical node, communication methods such as shm cannot be used without additional configuration (such as mounting /dev/shm to the container).
So we hope that the ideal communication library has the following capabilities:
- Support a variety of different P2P communication methods (GDR, IB, TCP, shm, cuda-ipc), and choose the best method according to different scenarios.
- Support rpc agent to dynamically connect and quit.
- Container awareness.
I tried to use nccl's send/recv API to achieve the above goals, but I found to my dismay that nccl is not very suitable for our scenario.:
- Pytorch's nccl API currently does not support specifying different cuda streams for nccl's send/recv. It is easy to deadlock when I don't strive to match the calling order of send/recv.
- The send/recv of nccl will involve CUDA kernel,data located at host memory is not supported, and an additional copy to the CUDA memory is required.
- When nccl establishes a communication group, it needs to establish a topology graph according to all processes, so if a new process joins or quits, it will become more troublesome: we need to stop the work of all processes, and then create a new communication group.
Finally I discovered tensorpipe, and I was very excited to find it worked well for most of our needs. Hence the above story.
ibv transport is only used for small control messages thus its performance isn't super critical
I would like to ask you for advice. I read the content of issue#405 and learned about transport transmission control information, and channel transmits heavy payloads. Channel uses one of the types of transport (uv, ibv, shm) for the actual communication. But in pytorch's rpc backend code, I only found channel implemented with uv, see here: . I'm curious, IB has better performance than TCP in both bandwidth and latency, why there is no channel implemented by ibv?
I'll review your PR and gladly land it. Unfortunately we've stopped actively investing in improving TensorPipe, hence we won't be able to do any of the above changes ourselves. But if you're willing to contribute more fixes I'll happily accept them.
I regret that pytorch no longer continues to support tensorpipe. Heterogeneous data (CPU data/GPU data) and the dynamics of communicating entities in RL training make it difficult to use existing APIs to implement, and tensorpipe perfectly solves this problem. I'd love to continue contributing to tensorpipe and hope this awesome project will continue to be active.
then we'd need a way to detect if two HCAs are in the same network, and AFAIK this isn't easily doable (except if one has UMAD access, but it tends to require root privileges). Do you have any ideas?
Sorry I don't have any more ideas right now, I think I need to look up more doc. But the good news is that now tensorpipe can run perfectly on our cluster. I'll update if I have any ideas.
why there is no channel implemented by ibv?
- There is one channel, called
cuda_gdr, which also uses IB, but is independent from theibvtransport. Currentlycuda_gdronly supports CUDA memory, but we designed it so that it can easily start handling CPU memory too (we just need to implement it). - There is the
basicchannel, which is simply an adapter that allows any transport to be used as a channel. It automatically selects the best transport, therefore you won't find an explicit instantiation ofbasicwithibvanywhere in the codebase. We consider this our "fallback" channel because it's a channel that will always work, although it isn't particularly specialized and might perform worse than ad-hoc channels. - For TCP based connections, the
basicchannel isn't able to achieve a very high bandwidth, because the data movement (from RAM to NIC) is operated in software by the kernel on the CPU, and a single thread isn't fast enough to drive the NICs at their full speed. That's why we added a "multiplexing" channel that bundles multipleuvtransports, and multiple threads, in order to load-balance the bandwidth across them. In IB the copy is handled by the NIC directly, thus we don't have this limitation, and this is why we don't need a multiplexing version ofibv.