sshkit
sshkit copied to clipboard
getaddrinfo
Hello,
when deploying to around ~100 servers we run into following issue once in a while. Only once in a while and very randomly - for one of the servers being deployed to. This has happened to people on macos and linux.
SSHKit:
:ExecuteError: Exception while executing as [email protected]: getaddrinfo: nodename nor servname provided, or not known
Are there some known limitations perhaps?
Thanks!
We recently accepted some PRs to deal with large numbers of servers, so your ~100 count isn't exceptional in that regard.
Might I suggest you add a simple ping
task, and try running things on a loop to get a harmless reproduction case, then you can run down some debugging options, such as clearing your DNS cache before, hard-coding the IPs in to your /etc/hosts files, etc, etc.
Your RUBY_VERSION can be significant here too, older Rubies, as a rule are less good at networking, but all rubies have been very good for at least 3-4 years, if not since the 2.0 release.
TLDR: I think I will try to reproduce this issue in Ruby (without sshkit) next.
Thank you for your input so far!
Having the IPs hard-coded in /etc/hosts does help. That has been my workaround for a while. When previously trying to replicate the DNS resolution issue I was not able to do so using other tools.
I had this bash script running since yesterday without issues as well.
while true
do
date
seq 1 100 | parallel --tag ping -c 1 www{}.oursite.com | grep 'Unknown'
sleep 15
done
I did run into the same issue again just now when trying to deploy. The script above was running - so the resolved IPs should still have been cached by the OS. Here two resolves failed at the same time.
#<Thread:0x00007fa45e2d4f60@/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:10 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
17: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:12:in `block (2 levels) in execute'
16: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `run'
15: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `instance_exec'
14: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/capistrano-3.11.0/lib/capistrano/scm/tasks/git.rake:8:in `block (3 levels) in eval_rakefile'
13: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:80:in `execute'
12: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `create_command_and_execute'
11: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `tap'
10: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `block in create_command_and_execute'
9: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:130:in `execute_command'
8: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:177:in `with_ssh'
7: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `with'
6: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `call'
5: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `start'
4: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `new'
3: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh/transport/session.rb:73:in `initialize'
2: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:631:in `tcp'
1: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `foreach'
/Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `getaddrinfo': getaddrinfo: nodename nor servname provided, or not known (SocketError)
1: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:11:in `block (2 levels) in execute'
/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:15:in `rescue in block (2 levels) in execute': Exception while executing as [email protected]: getaddrinfo: nodename nor servname provided, or not known (SSHKit::Runner::ExecuteError)
#<Thread:0x00007fa45e426fd0@/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:10 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
17: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:12:in `block (2 levels) in execute'
16: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `run'
15: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:31:in `instance_exec'
14: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/capistrano-3.11.0/lib/capistrano/scm/tasks/git.rake:8:in `block (3 levels) in eval_rakefile'
13: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:80:in `execute'
12: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `create_command_and_execute'
11: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `tap'
10: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/abstract.rb:148:in `block in create_command_and_execute'
9: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:130:in `execute_command'
8: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/netssh.rb:177:in `with_ssh'
7: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `with'
6: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/backends/connection_pool.rb:63:in `call'
5: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `start'
4: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh.rb:246:in `new'
3: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/net-ssh-5.2.0/lib/net/ssh/transport/session.rb:73:in `initialize'
2: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:631:in `tcp'
1: from /Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `foreach'
/Users/flo/.rvm/rubies/ruby-2.6.3/lib/ruby/2.6.0/socket.rb:227:in `getaddrinfo': getaddrinfo: nodename nor servname provided, or not known (SocketError)
1: from /Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:11:in `block (2 levels) in execute'
/Users/flo/.rvm/gems/ruby-2.6.3/gems/sshkit-1.20.0/lib/sshkit/runners/parallel.rb:15:in `rescue in block (2 levels) in execute': Exception while executing as [email protected]: getaddrinfo: nodename nor servname provided, or not known (SSHKit::Runner::ExecuteError)
I think I will try to reproduce this issue in Ruby (without sshkit) next.
I have tried to reproduce this in other ways (including below script) but have not been able to reproduce this issue besides when using capistrano to deploy (which uses sshkit).
Also I have tried switching DNS-server. Hardcoding the hosts in /etc/hosts fixes the issue for me so it seems.
# frozen_string_literal: true
require 'socket'
loop do
puts Time.now
threads = []
(1..100).each do |i|
threads << Thread.new do
addr = "www#{i}.oursite.com"
begin
addrinfo = Socket.getaddrinfo(addr, 'https', nil, Socket::SOCK_STREAM)
rescue Exception => e
puts "#{addr} #{Time.now}", e, ''
end
end
end
threads.each(&:join)
sleep 20
end
I have the same problem but with single server. Tried some tests:
Success
require 'socket'
Socket.getaddrinfo('subdomain.domain.com', 80, nil, Socket::SOCK_STREAM)
Success
require 'net/ssh'
Net::SSH.start('subdomain.domain.com', 'sshuser') do |ssh|
ssh.exec 'touch ~/test.txt'
end
Fail with the same stacktrace as Cervenka
require 'sshkit'
require 'sshkit/dsl'
include SSHKit::DSL
SSHKit::Backend::Netssh.configure do |ssh|
ssh.connection_timeout = 5
ssh.ssh_options = {
user: 'sshuser',
keys: %w[~/.ssh/id_rsa],
auth_methods: %w[ publickey ]
}
end
nodes = %w[ 'subdomain.domain.com' ]
on nodes do |node|
output = capture :ls, '-l'
puts output
end
Also
- Flushed and checked DNS in loop — everything is ok, nothing suspicious.
- Tried with IP — fail, the same exception.
- Hardcoding the hosts in /etc/hosts not fixes.
Do you have any ideas where to dig deeper?
ruby 3.1.1p18 (2022-02-18 revision 53f5fc4236) [x86_64-darwin21] sshkit (1.21.3)
Solved. I need more sleep.
nodes = %w[ 'subdomain.domain.com' ]
=> nodes = %w[ subdomain.domain.com ]