tensorpipe icon indicating copy to clipboard operation
tensorpipe copied to clipboard

[W tensorpipe_agent.cpp:641] RPC agent for Test0 encountered error when sending outgoing request #0 to Test1: ECONNREFUSED: connection refused

Open froody opened this issue 4 years ago • 5 comments

Consider the attached program, run it on two machines, e.g.

machine1% python hello.py -o machine1 -r 0 -t machine2% python hello.py -o machine1 -r 1 -t

Omitting the "-t" (i.e. don't use tensorpipe) everything works fine and rank1 prints

hello from the other side
got t = tensor([4.2842e+20, 4.5806e-41, 4.2842e+20, 4.5806e-41, 0.0000e+00])

With "-t", (i.e. use tensorpipe) rank 1 hangs and doesn't respond to ^C, and rank 0 prints

dist init r=0, world=2, host=learnfair1228
going to try init_process_group: tcp://learnfair1228:10638
got nccl, now barrier 0
got nccl, now rpc 0
going to try init_rpc tcp://learnfair1228:10639
got rpc 0
[W tensorpipe_agent.cpp:641] RPC agent for Test0 encountered error when sending outgoing request #0 to Test1: ECONNREFUSED: connection refused
Traceback (most recent call last):
  File "hello.py", line 63, in <module>
    torch.distributed.rpc.rpc_sync("Test1", rpc_send, args=(t,))
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 752, in rpc_sync
    return fut.wait()
RuntimeError: ECONNREFUSED: connection refused
import torch
import argparse
import os
from torch.distributed import rpc

parser = argparse.ArgumentParser(description="benchmark")
parser.add_argument("--rank", "-r", type=int, help="rank base", default=0)
parser.add_argument("--host", "-o", type=str, default=None, help="hostname")
parser.add_argument("-t",  action='store_true', default=False, help="use tensorpipe")

args = parser.parse_args()

hostname = args.host
world_size = 2
rank = args.rank

if hostname == None:
    hostname = "localhost"
print(f"dist init r={rank}, world={world_size}, host={hostname}")
os.environ["MASTER_ADDR"] = hostname
os.environ["MASTER_PORT"] = "10638"
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["RANK"] = str(rank)
os.environ["GLOO_SOCKET_IFNAME"] = "enp1s0f0"
if torch.__version__ == "1.6.0":
    init_method = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
    print(f"going to try init_process_group: {init_method}")
    torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size, init_method=init_method)
    print(f"got nccl, now barrier {rank}")
    torch.distributed.barrier()
    print(f"got nccl, now rpc {rank}")
    os.environ["MASTER_ADDR"] = hostname
    os.environ["MASTER_PORT"] = "10639"
    init_method = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
    print(f"going to try init_rpc {init_method}")
    if args.t:
        rpc.init_rpc(
            f"Test{rank}",
            rank=rank,
            world_size=world_size,
            backend=rpc.BackendType.TENSORPIPE,
            rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
                init_method=init_method
            ),
        )
    else:
        rpc.init_rpc(
            f"Test{rank}",
            rank=rank,
            world_size=world_size,
        )
    print(f"got rpc {rank}")
else:
    rpc.init_rpc(f"Test{rank}", rank=rank, world_size=world_size)
    print(f"got rpc {rank}")

def rpc_send(t):
    print(f"hello from the other side")
    print(f"got t = {t}")

t = torch.Tensor(5)
if rank == 0:
    torch.distributed.rpc.rpc_sync("Test1", rpc_send, args=(t,))

torch.distributed.barrier()

froody avatar Aug 07 '20 05:08 froody

Apparently the solution is to add os.environ["TP_SOCKET_IFNAME"] = "enp1s0f0"

froody avatar Aug 11 '20 20:08 froody

Had set GLOO_SOCKET_IFNAME and TP_SOCKET_IFNAME, problem still exsist.

boscotsang avatar Sep 10 '20 09:09 boscotsang

I also met the same problem. Had you find the answer? @boscotsang

SHu0421 avatar Jan 07 '22 12:01 SHu0421

@SHu0421 I have no idea and recently I didn't use rpc any more.

boscotsang avatar Jan 08 '22 07:01 boscotsang

@SHu0421 I have no idea and recently I didn't use rpc any more.

oh, thanks!

SHu0421 avatar Jan 08 '22 08:01 SHu0421