tensorpipe
tensorpipe copied to clipboard
[W tensorpipe_agent.cpp:641] RPC agent for Test0 encountered error when sending outgoing request #0 to Test1: ECONNREFUSED: connection refused
Consider the attached program, run it on two machines, e.g.
machine1% python hello.py -o machine1 -r 0 -t machine2% python hello.py -o machine1 -r 1 -t
Omitting the "-t" (i.e. don't use tensorpipe) everything works fine and rank1 prints
hello from the other side
got t = tensor([4.2842e+20, 4.5806e-41, 4.2842e+20, 4.5806e-41, 0.0000e+00])
With "-t", (i.e. use tensorpipe) rank 1 hangs and doesn't respond to ^C, and rank 0 prints
dist init r=0, world=2, host=learnfair1228
going to try init_process_group: tcp://learnfair1228:10638
got nccl, now barrier 0
got nccl, now rpc 0
going to try init_rpc tcp://learnfair1228:10639
got rpc 0
[W tensorpipe_agent.cpp:641] RPC agent for Test0 encountered error when sending outgoing request #0 to Test1: ECONNREFUSED: connection refused
Traceback (most recent call last):
File "hello.py", line 63, in <module>
torch.distributed.rpc.rpc_sync("Test1", rpc_send, args=(t,))
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 81, in wrapper
return func(*args, **kwargs)
File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 752, in rpc_sync
return fut.wait()
RuntimeError: ECONNREFUSED: connection refused
import torch
import argparse
import os
from torch.distributed import rpc
parser = argparse.ArgumentParser(description="benchmark")
parser.add_argument("--rank", "-r", type=int, help="rank base", default=0)
parser.add_argument("--host", "-o", type=str, default=None, help="hostname")
parser.add_argument("-t", action='store_true', default=False, help="use tensorpipe")
args = parser.parse_args()
hostname = args.host
world_size = 2
rank = args.rank
if hostname == None:
hostname = "localhost"
print(f"dist init r={rank}, world={world_size}, host={hostname}")
os.environ["MASTER_ADDR"] = hostname
os.environ["MASTER_PORT"] = "10638"
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["RANK"] = str(rank)
os.environ["GLOO_SOCKET_IFNAME"] = "enp1s0f0"
if torch.__version__ == "1.6.0":
init_method = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
print(f"going to try init_process_group: {init_method}")
torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size, init_method=init_method)
print(f"got nccl, now barrier {rank}")
torch.distributed.barrier()
print(f"got nccl, now rpc {rank}")
os.environ["MASTER_ADDR"] = hostname
os.environ["MASTER_PORT"] = "10639"
init_method = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
print(f"going to try init_rpc {init_method}")
if args.t:
rpc.init_rpc(
f"Test{rank}",
rank=rank,
world_size=world_size,
backend=rpc.BackendType.TENSORPIPE,
rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
init_method=init_method
),
)
else:
rpc.init_rpc(
f"Test{rank}",
rank=rank,
world_size=world_size,
)
print(f"got rpc {rank}")
else:
rpc.init_rpc(f"Test{rank}", rank=rank, world_size=world_size)
print(f"got rpc {rank}")
def rpc_send(t):
print(f"hello from the other side")
print(f"got t = {t}")
t = torch.Tensor(5)
if rank == 0:
torch.distributed.rpc.rpc_sync("Test1", rpc_send, args=(t,))
torch.distributed.barrier()
Apparently the solution is to add
os.environ["TP_SOCKET_IFNAME"] = "enp1s0f0"
Had set GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME, problem still exsist.
I also met the same problem. Had you find the answer? @boscotsang
@SHu0421 I have no idea and recently I didn't use rpc any more.
@SHu0421 I have no idea and recently I didn't use rpc any more.
oh, thanks!