Connection refused when trying to connect to Distributed worker
Hello everyone :) I am trying to use multiple workers with Distributed, however it doesn't work as planned. I always get a connection refused error as below:
ERROR: TaskFailedException
nested task error: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
[2] worker_from_id
@ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
[3] #remote_do#166
@ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
[4] remote_do
@ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
[5] kill(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:673
[6] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
[7] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[8] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:423
caused by: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets /usr/local/julia/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
[2] connect
@ /usr/local/julia/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
[4] connect(manager::Distributed.SSHManager, pid::Int64, config::WorkerConfig)
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
[5] create_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig)
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
[6] setup_launched_worker(manager::Distributed.SSHManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[7] (::Distributed.var"#41#44"{Distributed.SSHManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:423
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:381
[2] macro expansion
@ ./task.jl:400 [inlined]
[3] addprocs_locked(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:exename, :dir, :sshflags, :exeflags), Tuple{String, String, Vector{String}, Cmd}}})
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:487
[4] addprocs(manager::Distributed.SSHManager; kwargs::Base.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:exename, :dir, :sshflags, :exeflags), Tuple{String, String, Vector{String}, Cmd}}})
@ Distributed /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:447
[5] #addprocs#249
@ /usr/local/julia/share/julia/stdlib/v1.7/Distributed/src/managers.jl:143 [inlined]
[6] top-level scope
@ REPL[4]:1
The code I am trying to use to connect is as follows:
using Distributed
keyFile = raw"~/.ssh/juliakey"
addprocs([("[email protected]:2222 0.0.0.0:3000", 1)]; exename="/usr/local/julia/bin/julia", dir="/home/docker", sshflags=["-vvv", "-i", keyFile], exeflags=`--threads=auto`)
This error shows, that the SSH connection is working perfectly fine, even a julia instance is started on the correct port (checked that with different tools), however it can't connect. Testing it with two different machines results in this error. In this case it doesn't matter, whether this machine hosts a docker container with julia and ssh or if I try to use it natievly on windows. I has nothing to do with the firewall, as I disabled it for testing.
The strange thing is however, that it works when I connect to myself on localhost via ssh.
To add to the confusion, I started two docker containers with julia and ssh on my machine and tried to connect between both, with the same result. It seems, that as soon as we cross the border to another machine, it fails to connect.
I added the dockerfile, if anyone wants to test it. Steps to reproduce:
- generate key or use existing key (update compose file)
-
docker-compose up - copy key to one of the containers
docker cp .\juliakey dockerjl-ssh2-1:/home/docker/.ssh/juliakey - add IP to known hosts
ssh-keyscan -p 2222 -t rsa 192.168.2.220 >> ~/.ssh/known_hosts - start julia with
/usr/local/julia/bin/julia - run julia code above in the REPL
- receive Error message
Thanks in advance for any help :)
Is there some network address translation (NAT) going on between you and the worker, i.e. is the worker perhaps not returning the same IP address as the SSH command connected to in the first place?
The way Distributed works is:
-
addprocsmakes an SSH connection to worker W - on worker W julia is started, calls
getipaddr()and returns that IP address A -
addprocsnow makes a TCP connection to address A
If A is an address behind a NAT, ssh A will obviously fail to reach the same machine as ssh W, so make sure there is no NAT between you and the worker. You can't use NAT for protocols that communicate numerical IP addresses, whether that is ftp, SIP or Distributed.
Thank you very much for the tip!
I am not behind a NAT, however this lead to finding the issue:
I have docker + other networking services installed, which pollutes the network space given by getipaddrs(). The singular function getipaddr() returns the first IPv4 value in that list, which happens to be some kind of internal docker IP address.
Therefore, the server is able to connect via SSH, but not with the newly returned IP, as that is only relevant internally.
Is there any way to force it to use the same IP as for SSH? Or even better, is there a way to force julia to use a specific IP-Address in getipaddr() by e.g. a commandline argument?
Thanks!
Okay, so I made further tests with the bind-to feature. Using that, I can force it to use a specific IP address for binding AND for returning to the Master.
However, this is also a problem in complex cases, like when running in two docker containers. Inside the docker container, it only knows the internal IP, but not the IP accessible from the outside.
Therefore, I would suggest adding two optional Environment variables and printing these out on the worker, if available. https://github.com/JuliaLang/julia/blob/bf534986350a991e4a1b29126de0342ffd76205e/stdlib/Distributed/src/cluster.jl#L252-L254
This would help a lot in other cases, where the worker is e.g. bind a reverse proxy for example or the Ports are mapped to different ports by e.g. Docker.
I had long planned to submit a PR for replacing the use of getipaddr() in Distributed, because that function simply makes no sense, as devices can of course have multiple IP addresses. I'd like a worker that has been invoked via SSH instead to bind to the same IP address as SSH did, because that is known to work, and can usually be fetched from the third word in the environment variable SSH_CONNECTION, as in
--- a/stdlib/Distributed/src/cluster.jl
+++ b/stdlib/Distributed/src/cluster.jl
@@ -1276,7 +1276,12 @@ function init_bind_addr()
else
bind_port = 0
try
- bind_addr = string(getipaddr())
+ if haskey(ENV, "SSH_CONNECTION")
+ # reuse the IP address that ssh already used to get here
+ bind_addr = split(ENV["SSH_CONNECTION"], ' ')[3]
+ else
+ bind_addr = string(first(filter(!islinklocaladdr, getipaddrs())))
+ end
catch
# All networking is unavailable, initialize bind_addr to the loopback address
# Will cause an exception to be raised only when used.
However that patch alone will likely break things, because SSH might have used an IPv6 address, and Distributed can't cope with IPv6 addresses yet.
I wrote a half-finished IPv6 patch for Distributed back in the days of Julia 1.3. I'll dig it out and try to finish it into a PR, and hope that will eventually solve your problem as well. Until then, just make sure that your worker has exactly one IPv4 address (except for loopback).
Hello,
I am having a similar issues trying to launch a Julia program on 2 nodes of a cluster.
There is a ECONNREFUSED at the beginning. I have no administration rights on the cluster therefore I have much trouble to debug this.
Do you think this could come from the same problem ? What would you suggest to troubleshoot this ?
I already had this problem on two different clusters so I'm wondering if this is a common problem or just me.
I'm using the following script to start my program
using ClusterManagers
using Distributed
using LinearAlgebra
LinearAlgebra.BLAS.set_num_threads(1)
ncores = parse(Int, ENV["SLURM_NTASKS"])
@info "Setting up for SLURM, $ncores tasks detected"; flush(stdout)
addprocs(SlurmManager(ncores))
and the full error
nested task error: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] worker_from_id(pg::Distributed.ProcessGroup, i::Int64)
@ Distributed /home/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1089
[2] worker_from_id
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:1086 [inlined]
[3] #remote_do#170
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
[4] remote_do
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:561 [inlined]
[5] kill
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:668 [inlined]
[6] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:600
[7] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[8] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:429
caused by: IOError: connect: connection refused (ECONNREFUSED)
Stacktrace:
[1] wait_connected(x::Sockets.TCPSocket)
@ Sockets /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:532
[2] connect
@ /home/julia-1.7.3/share/julia/stdlib/v1.7/Sockets/src/Sockets.jl:567 [inlined]
[3] connect_to_worker(host::String, port::Int64)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:632
[4] connect(manager::SlurmManager, pid::Int64, config::WorkerConfig)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/managers.jl:559
[5] create_worker(manager::SlurmManager, wconfig::WorkerConfig)
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:596
[6] setup_launched_worker(manager::SlurmManager, wconfig::WorkerConfig, launched_q::Vector{Int64})
@ Distributed /home/julia-1.7.3/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:541
[7] (::Distributed.var"#45#48"{SlurmManager, Vector{Int64}, WorkerConfig})()
@ Distributed ./task.jl:429
...and 212 more exceptions.
Please tell me if this is the wrong way to post. Matthias
@matthiasbe The IP address issue I mentioned is specific to SSHManager, whereas you seem to use SlurmManager instead, about which I know nothing.
Thank you for the quick response, I see the ClusterManagers.jl package has its own behavior (https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/src/slurm.jl).
Is there some network address translation (NAT) going on between you and the worker, i.e. is the worker perhaps not returning the same IP address as the SSH command connected to in the first place?
The way
Distributedworks is:
addprocsmakes an SSH connection to worker W- on worker W julia is started, calls
getipaddr()and returns that IP address Aaddprocsnow makes a TCP connection to address AIf A is an address behind a NAT,
ssh Awill obviously fail to reach the same machine asssh W, so make sure there is no NAT between you and the worker. You can't use NAT for protocols that communicate numerical IP addresses, whether that is ftp, SIP orDistributed.
Hi, thanks for this insight. Just one question regarding this: What ports need to be opend to use julia like this? I can connect using ssh on port 22 and julia also starts on the remote machine. However, the TCP connection can't be established afterwards. Can I see/specify the port to be used for this? And is it sufficiend to open the ports on the remote machine or do they have to be open on the client side too?
What ports need to be opend to use julia like this? I can connect using ssh on port 22 and julia also starts on the remote machine. However, the TCP connection can't be established afterwards. Can I see/specify the port to be used for this?
If you are using SSHManager, then by default the master opens an SSH connection to the worker and starts
$ julia --worker
cookie
julia_worker:
9880#198.51.100.42
The master supplies via standard input a cookie followed by a linefeed, and the worker then answers with port#ip where it binds and thus can be contacted via TCP to do the actual RPC work outside ssh. By default, the worker lets the kernel pick an arbitrary free TCP port number (which has a high success rate).
As you can see in Distributed.launch_on_machine, the machine string can have the form [user@]host[:port] bind_addr[:bind_port], and the second half should give you control over the IP address and port number to which the worker binds. But that requires that that port is free (which may not always be the case if it has been used within the last couple of minutes, and the TCP session wasn't finished properly).
There is also tunnel mode, which does not use a separate TCP connection outside SSH.
Seems to be a usage discussion better suited to the Discourse forum, where there better controls for message formatting and threading. And if not there, then on https://github.com/JuliaLang/Distributed.jl/ issue tracker
This remains an issue
@andreasnoack If you're on SLURM, a lot of the problems may be related to SSH, pam_slurm_adopt, and the cgroup handler. On my cluster, SSH connections were not getting intercepted and added to the cgroup and so were liable to be killed or run out of memory despite having allocated resources. Or SSH may simply be blocked entirely. I'd stick to SlurmClusterManager which wraps everything inside an srun call or roll your own srun driven solution based on it.