Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

Connection refused when trying to connect to Distributed worker

Open BloodWorkXGaming opened this issue 3 years ago • 12 comments

Hello everyone :) I am trying to use multiple workers with Distributed, however it doesn't work as planned. I always get a connection refused error as below:

ERROR: TaskFailedException

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
     [2] worker_from_id
       @ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
     [3] #remote_do#166
       @ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [4] remote_do
       @ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:673
     [6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
     [7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [8] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:423

    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /usr/local/julia/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
     [2] connect
       @ /usr/local/julia/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
     [4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
     [5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
     [6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [7] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:423
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:381
 [2] macro expansion
   @ ./task.jl:400 [inlined]
 [3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:exename, :dir, :sshflags, :exeflags), Tuple{String, String, Vector{String}, Cmd}}})
   @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:487
 [4] addprocs(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:exename, :dir, :sshflags, :exeflags), Tuple{String, String, Vector{String}, Cmd}}})
   @ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
 [5] #addprocs#249
   @ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:143 [inlined]
 [6] top-level scope
   @ REPL[4]:1

The code I am trying to use to connect is as follows:

using Distributed
keyFile = raw"~/.ssh/juliakey"
addprocs([("[email protected]:2222 0.0.0.0:3000", 1)]; exename="/usr/local/julia/bin/julia", dir="/home/docker", sshflags=["-vvv", "-i", keyFile], exeflags=`--threads=auto`)

This error shows, that the SSH connection is working perfectly fine, even a julia instance is started on the correct port (checked that with different tools), however it can't connect. Testing it with two different machines results in this error. In this case it doesn't matter, whether this machine hosts a docker container with julia and ssh or if I try to use it natievly on windows. I has nothing to do with the firewall, as I disabled it for testing.

The strange thing is however, that it works when I connect to myself on localhost via ssh.

To add to the confusion, I started two docker containers with julia and ssh on my machine and tried to connect between both, with the same result. It seems, that as soon as we cross the border to another machine, it fails to connect.

I added the dockerfile, if anyone wants to test it. Steps to reproduce:

  1. generate key or use existing key (update compose file)
  2. docker-compose up
  3. copy key to one of the containers docker cp .\juliakey dockerjl-ssh2-1:/home/docker/.ssh/juliakey
  4. add IP to known hosts ssh-keyscan -p 2222 -t rsa 192.168.2.220 >> ~/.ssh/known_hosts
  5. start julia with /usr/local/julia/bin/julia
  6. run julia code above in the REPL
  7. receive Error message

DockerJL.zip

Thanks in advance for any help :)

BloodWorkXGaming avatar May 10 '22 10:05 BloodWorkXGaming

Is there some network address translation (NAT) going on between you and the worker, i.e. is the worker perhaps not returning the same IP address as the SSH command connected to in the first place?

The way Distributed works is:

  • addprocs makes an SSH connection to worker W
  • on worker W julia is started, calls getipaddr() and returns that IP address A
  • addprocs now makes a TCP connection to address A

If A is an address behind a NAT, ssh A will obviously fail to reach the same machine as ssh W, so make sure there is no NAT between you and the worker. You can't use NAT for protocols that communicate numerical IP addresses, whether that is ftp, SIP or Distributed.

mgkuhn avatar May 10 '22 18:05 mgkuhn

Thank you very much for the tip! I am not behind a NAT, however this lead to finding the issue: I have docker + other networking services installed, which pollutes the network space given by getipaddrs(). The singular function getipaddr() returns the first IPv4 value in that list, which happens to be some kind of internal docker IP address. Therefore, the server is able to connect via SSH, but not with the newly returned IP, as that is only relevant internally.

Is there any way to force it to use the same IP as for SSH? Or even better, is there a way to force julia to use a specific IP-Address in getipaddr() by e.g. a commandline argument?

Thanks!

BloodWorkXGaming avatar May 10 '22 20:05 BloodWorkXGaming

Okay, so I made further tests with the bind-to feature. Using that, I can force it to use a specific IP address for binding AND for returning to the Master.

However, this is also a problem in complex cases, like when running in two docker containers. Inside the docker container, it only knows the internal IP, but not the IP accessible from the outside.

Therefore, I would suggest adding two optional Environment variables and printing these out on the worker, if available. https://github.com/JuliaLang/julia/blob/bf534986350a991e4a1b29126de0342ffd76205e/stdlib/Distributed/src/cluster.jl#L252-L254

This would help a lot in other cases, where the worker is e.g. bind a reverse proxy for example or the Ports are mapped to different ports by e.g. Docker.

BloodWorkXGaming avatar May 11 '22 07:05 BloodWorkXGaming

I had long planned to submit a PR for replacing the use of getipaddr() in Distributed, because that function simply makes no sense, as devices can of course have multiple IP addresses. I'd like a worker that has been invoked via SSH instead to bind to the same IP address as SSH did, because that is known to work, and can usually be fetched from the third word in the environment variable SSH_CONNECTION, as in

--- a/stdlib/Distributed/src/cluster.jl
+++ b/stdlib/Distributed/src/cluster.jl
@@ -1276,7 +1276,12 @@ function init_bind_addr()
     else
         bind_port = 0
         try
-            bind_addr = string(getipaddr())
+            if haskey(ENV, "SSH_CONNECTION")
+                # reuse the IP address that ssh already used to get here
+                bind_addr = split(ENV["SSH_CONNECTION"], ' ')[3]
+            else
+                bind_addr = string(first(filter(!islinklocaladdr, getipaddrs())))
+            end
         catch
             # All networking is unavailable, initialize bind_addr to the loopback address
             # Will cause an exception to be raised only when used.

However that patch alone will likely break things, because SSH might have used an IPv6 address, and Distributed can't cope with IPv6 addresses yet.

I wrote a half-finished IPv6 patch for Distributed back in the days of Julia 1.3. I'll dig it out and try to finish it into a PR, and hope that will eventually solve your problem as well. Until then, just make sure that your worker has exactly one IPv4 address (except for loopback).

mgkuhn avatar May 18 '22 10:05 mgkuhn

Hello,

I am having a similar issues trying to launch a Julia program on 2 nodes of a cluster. There is a ECONNREFUSED at the beginning. I have no administration rights on the cluster therefore I have much trouble to debug this.

Do you think this could come from the same problem ? What would you suggest to troubleshoot this ?

I already had this problem on two different clusters so I'm wondering if this is a common problem or just me.

I'm using the following script to start my program

  using ClusterManagers
  using Distributed
  using LinearAlgebra

  LinearAlgebra.BLAS.set_num_threads(1)

  ncores = parse(Int, ENV["SLURM_NTASKS"])
  @info "Setting up for SLURM, $ncores tasks detected"; flush(stdout)
  addprocs(SlurmManager(ncores))

and the full error

    nested task error: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
       @ Distributed /home/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
     [2] worker_from_id
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
     [3] #remote_do#170
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [4] remote_do
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
     [5] kill
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:668 [inlined]
     [6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
     [7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:429
    
    caused by: IOError: connect: connection refused (ECONNREFUSED)
    Stacktrace:
     [1] wait_connected(x::Sockets.TCPSocket)
       @ Sockets /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
     [2] connect
       @ /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
     [3] connect_to_worker(host::String, port::Int64)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
     [4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
     [5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
     [6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
       @ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
     [7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
       @ Distributed ./task.jl:429

...and 212 more exceptions.

Please tell me if this is the wrong way to post. Matthias

matthiasbe avatar Jul 27 '22 09:07 matthiasbe

@matthiasbe The IP address issue I mentioned is specific to SSHManager, whereas you seem to use SlurmManager instead, about which I know nothing.

mgkuhn avatar Jul 27 '22 09:07 mgkuhn

Thank you for the quick response, I see the ClusterManagers.jl package has its own behavior (https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/src/slurm.jl).

matthiasbe avatar Jul 27 '22 09:07 matthiasbe

Is there some network address translation (NAT) going on between you and the worker, i.e. is the worker perhaps not returning the same IP address as the SSH command connected to in the first place?

The way Distributed works is:

  • addprocs makes an SSH connection to worker W
  • on worker W julia is started, calls getipaddr() and returns that IP address A
  • addprocs now makes a TCP connection to address A

If A is an address behind a NAT, ssh A will obviously fail to reach the same machine as ssh W, so make sure there is no NAT between you and the worker. You can't use NAT for protocols that communicate numerical IP addresses, whether that is ftp, SIP or Distributed.

Hi, thanks for this insight. Just one question regarding this: What ports need to be opend to use julia like this? I can connect using ssh on port 22 and julia also starts on the remote machine. However, the TCP connection can't be established afterwards. Can I see/specify the port to be used for this? And is it sufficiend to open the ports on the remote machine or do they have to be open on the client side too?

kevin-kruse avatar Jan 31 '24 15:01 kevin-kruse

What ports need to be opend to use julia like this? I can connect using ssh on port 22 and julia also starts on the remote machine. However, the TCP connection can't be established afterwards. Can I see/specify the port to be used for this?

If you are using SSHManager, then by default the master opens an SSH connection to the worker and starts

$ julia --worker
cookie
julia_worker:
9880#198.51.100.42

The master supplies via standard input a cookie followed by a linefeed, and the worker then answers with port#ip where it binds and thus can be contacted via TCP to do the actual RPC work outside ssh. By default, the worker lets the kernel pick an arbitrary free TCP port number (which has a high success rate).

As you can see in Distributed.launch_on_machine, the machine string can have the form [user@]host[:port] bind_addr[:bind_port], and the second half should give you control over the IP address and port number to which the worker binds. But that requires that that port is free (which may not always be the case if it has been used within the last couple of minutes, and the TCP session wasn't finished properly).

There is also tunnel mode, which does not use a separate TCP connection outside SSH.

mgkuhn avatar Jan 31 '24 17:01 mgkuhn

Seems to be a usage discussion better suited to the Discourse forum, where there better controls for message formatting and threading. And if not there, then on https://github.com/JuliaLang/Distributed.jl/ issue tracker

vtjnash avatar Feb 11 '24 00:02 vtjnash

This remains an issue

andreasnoack avatar Apr 11 '25 09:04 andreasnoack

@andreasnoack If you're on SLURM, a lot of the problems may be related to SSH, pam_slurm_adopt, and the cgroup handler. On my cluster, SSH connections were not getting intercepted and added to the cgroup and so were liable to be killed or run out of memory despite having allocated resources. Or SSH may simply be blocked entirely. I'd stick to SlurmClusterManager which wraps everything inside an srun call or roll your own srun driven solution based on it.

jbphyswx avatar May 08 '25 18:05 jbphyswx