sshkit
sshkit copied to clipboard
Issue: Parallel executor performance is mostly linear
We use capistrano to deploy on several hundred servers at once, and we noticed capistrano's performance was heavily tied to the number of servers.
To prove it I ran the following benchmark:
require 'benchmark'
require 'csv'
task :bench do
servers = roles(:borg).to_a
timings = {}
if ENV['PREWARM']
on servers do
execute :echo, 'pong > /dev/null'
end
end
[1, 10, 50, 100].each do |count|
timings[count] = Benchmark.measure do
on servers.pop(count) do
execute :echo, 'pong > /dev/null'
end
end
end
res = [%w(count user system real)]
timings.each do |count, timing|
res << [count, timing.utime, timing.stime, timing.real]
end
puts res.map(&:to_csv).join
end
Here are the results: Spreadsheet
The first graph is with "cold" connections, meaning it's connection establishment plus the command execution. In the second graph, all the connections were pre established before the benchmark.
I'm still investigating to figure out where exactly the bottleneck (or bottlenecks) exactly is. I know the GIL is not for nothing, but capistrano / SSHKit being IO heavy, I think there is other reasons.
cc @kirs as well as @csfrancis
Oh and I forgot, here's a profile of a similar benchmark with about 180 severs:
Cold:
==================================
Mode: wall(1000)
Samples: 3592 (6.04% miss rate)
GC: 390 (10.86%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
626 (17.4%) 620 (17.3%) SSHKit::Runner::Parallel#execute
511 (14.2%) 511 (14.2%) block (3 levels) in Net::SSH::KnownHosts#keys_for
884 (24.6%) 306 (8.5%) block (2 levels) in Net::SSH::KnownHosts#keys_for
158 (4.4%) 111 (3.1%) Net::SSH::Transport::Kex::DiffieHellmanGroup1SHA1#send_kexinit
203 (5.7%) 109 (3.0%) OpenSSL::PKey::DH#valid?
94 (2.6%) 94 (2.6%) block in OpenSSL::PKey::DH#valid?
89 (2.5%) 89 (2.5%) Net::SSH::Compat.io_select
60 (1.7%) 60 (1.7%) block (2 levels) in Net::SSH::Transport::ServerVersion#negotiate!
55 (1.5%) 55 (1.5%) block (2 levels) in Net::SSH::Connection::Channel#forward_local_env
52 (1.4%) 52 (1.4%) Net::SSH::KnownHosts#known_host_hash?
956 (26.6%) 51 (1.4%) Net::SSH::KnownHosts#keys_for
93 (2.6%) 49 (1.4%) block in Net::SSH::Config.load
59 (1.6%) 46 (1.3%) Net::SSH::Buffer#read
127 (3.5%) 34 (0.9%) Net::SSH::Config.load
33 (0.9%) 33 (0.9%) Net::SSH::Buffer#write_long
181 (5.0%) 33 (0.9%) Net::SSH::Transport::PacketStream#poll_next_packet
64 (1.8%) 32 (0.9%) Net::SSH::BufferedIo#fill
52 (1.4%) 31 (0.9%) Net::SSH::BufferedIo#send_pending
30 (0.8%) 30 (0.8%) SSHKit::Backend::ConnectionPool::Entry#expired?
29 (0.8%) 29 (0.8%) Logger#debug?
33 (0.9%) 28 (0.8%) Net::SSH::Transport::State#update_next_iv
27 (0.8%) 27 (0.8%) Net::SSH::Buffer#length
27 (0.8%) 27 (0.8%) Net::SSH::Config.pattern2regex
26 (0.7%) 26 (0.7%) Net::SSH::Transport::HMAC::Abstract.mac_length
85 (2.4%) 23 (0.6%) Net::SSH::Buffer.from
22 (0.6%) 22 (0.6%) OpenSSL::PKey::RSA#ssh_do_verify
1374 (38.3%) 21 (0.6%) block in Net::SSH::Transport::Session#poll_message
905 (25.2%) 21 (0.6%) block in Net::SSH::KnownHosts#keys_for
20 (0.6%) 20 (0.6%) Net::SSH::Buffer#initialize
17 (0.5%) 17 (0.5%) Net::SSH::KnownHosts#initialize
Warmed:
==================================
Mode: wall(1000)
Samples: 450 (6.83% miss rate)
GC: 98 (21.78%)
==================================
TOTAL (pct) SAMPLES (pct) FRAME
54 (12.0%) 54 (12.0%) SSHKit::Backend::ConnectionPool::Entry#expired?
50 (11.1%) 50 (11.1%) Net::SSH::Compat.io_select
28 (6.2%) 28 (6.2%) block (2 levels) in Net::SSH::Connection::Channel#forward_local_env
11 (2.4%) 11 (2.4%) Net::SSH::Transport::HMAC::Abstract.mac_length
11 (2.4%) 10 (2.2%) Net::SSH::Buffer#read
22 (4.9%) 10 (2.2%) Net::SSH::BufferedIo#fill
9 (2.0%) 9 (2.0%) Net::SSH::Transport::State#update_next_iv
12 (2.7%) 9 (2.0%) Net::SSH::BufferedIo#send_pending
63 (14.0%) 8 (1.8%) block (2 levels) in SSHKit::Backend::ConnectionPool#prune_expired
8 (1.8%) 8 (1.8%) Net::SSH::Buffer#initialize
7 (1.6%) 7 (1.6%) Net::SSH::Buffer#append
7 (1.6%) 7 (1.6%) Thread.create_with_logging_context
6 (1.3%) 6 (1.3%) Net::SSH::Buffer#write_long
5 (1.1%) 5 (1.1%) Logger#debug?
5 (1.1%) 5 (1.1%) block in Net::SSH::Transport::PacketStream#enqueue_packet
5 (1.1%) 5 (1.1%) Net::SSH::Transport::HMAC::Abstract.digest_class
5 (1.1%) 5 (1.1%) #<Module:0x007fafe25598f0>.reporter
9 (2.0%) 4 (0.9%) Net::SSH::Transport::State#update_cipher
9 (2.0%) 4 (0.9%) SSHKit::Color#colorize
4 (0.9%) 4 (0.9%) Net::SSH::BufferedIo#input
4 (0.9%) 4 (0.9%) block (3 levels) in <class:Digest>
271 (60.2%) 4 (0.9%) Net::SSH::Connection::Session#dispatch_incoming_packets
4 (0.9%) 4 (0.9%) SSHKit::Color#colorize?
10 (2.2%) 4 (0.9%) Net::SSH::Buffer#read_long
8 (1.8%) 4 (0.9%) block in Net::SSH::Packet#instantiate!
3 (0.7%) 3 (0.7%) Net::SSH::Buffer#length
14 (3.1%) 3 (0.7%) Capistrano::DSL#on
13 (2.9%) 3 (0.7%) Net::SSH::Buffer.from
3 (0.7%) 3 (0.7%) SSHKit::Backend::Abstract#initialize
3 (0.7%) 3 (0.7%) Net::SSH::Buffer#to_s
Interesting! I am also surprised that the results are linear. Please keep us updated on your findings. I wonder if JRuby results are any different.
Well, while posting the profile I figured that:
54 (12.0%) 54 (12.0%) SSHKit::Backend::ConnectionPool::Entry#expired?
Was very sketchy. Turns out according to our monitoring, running migrations very often takes more than 30 seconds (we use docker so, docker pull, plus rails boot it's easy to cross that mark).
Which mean all the connections established before the deploy:migrate, have to be re-established after.
So I'll see tomorrow but I think SSHKit.config.backend.pool.idle_timeout = 120 should help quite a bit in our real world situation.
But it still doesn't explain the benchmark results so I'll keep digging.
Based on @csfrancis input and also on https://github.com/net-ssh/net-ssh/pull/303, it seems likely that the linear performance comes from a good part of IO.select which get slower the more file descriptors it has to watch.
If https://github.com/net-ssh/net-ssh/pull/303 goes through, it would be very interesting to experiment with an evented backend.
However, the profile clearly shows that a huge amount of time is spent in SSHKit::Backend::ConnectionPool::Entry#expired?. After investigation it's clear that the ConnectionPool put way to much zeal into pruning expired connection.
The prune_expired method that iterates over all the connections is called both on checkin and on checkout, so it's executed twice per host and per command.
e.g. if you execute 5 commands, on 10 servers, it will be executed 2 * 5 * 10, 100 times! Also note that it synchronize a mutex, so it's very probable that it kills limits the concurrency.
I'll start working on a PR to see if those prune could be batched somehow, e.g triggered after X checkins/checkouts.
I'd also like to be able to disable the idle_timeout check without losing the connection pooling entirely.
~~Quick update, it seems like idle_timeout disabling connection pooling entirely is a regression from https://github.com/capistrano/sshkit/pull/298. ~~ Sorry I was wrong.
Based on @csfrancis input and also on net-ssh/net-ssh#303, it seems likely that the linear performance comes from a good part of IO.select which get slower the more file descriptors it has to watch.
Maybe strace -C would tell, as in it would measure the time on just that syscall.
https://github.com/capistrano/sshkit/pull/327 allow to disable pruning entirely for those interested. I'm still ensure how to reduce the prune_expired calls. Maybe with a counter?
By inspecting the profiles, I also discovered that net/ssh waste a lot of CPU parsing the same known_hosts files on every connection establishment.
I'll try to see if they could be cached, ref: https://github.com/net-ssh/net-ssh/issues/307
Just for adding up, and since you've mentioned my pull request still being evaluated: the known hosts parsing is a huge bottleneck in net-ssh. As you have correctly mentioned, the parsing will occur for every host you connect to. Performance of this operation is dependent of the file system you're running it from (if you're in a networked-FS, as I usually work from, this kills your performance).
Opening a file system descriptor will also kill any chance you want to have of toying around an evented framework, as the ruby implementations I know (nio4r, eventmachine) famously only listen to network sockets, which means that every File I/O will block the event loop.
IO.select calls do get expensive, but just because they're spread across the library for read/write checks. If the pull request gets accepted, the number of IO.select direct calls (io/wait still falls be to select() when no poll() is implemented) will be reduced to 1 per session traffic processing. For this there is currently no optimal solution in sight.
@TiagoCardoso1983 I tried net-ssh/net-ssh#303 but only got very marginal performance improvements. Is it because the main select in #process is still not patched?
@byroot which OS are you testing it on? Unless you're in a poll()-enabled one, it's the same thing. Even if you're in a poll()-enabled environment, the bulk of the work is done by the #process call, which is unpatchable in the current state and uses IO.select. Just for you to understand what I mean, net-ssh has a loop but is far from being truly non-blocking. What it does: it checks whether there is something to read/write, proceeding to read/write with the Socket blocking APIs (limiting the number of bytes transmitted, yes). Truly nonblocking would be: (1) try to read/write "non-block, (2a) succeed, or (2b) wait for read/write, and go back to (1). Improve this bit would mean rewriting the library, I don't know if that's feasible.
OS are you testing it on?
OSX, so it's poll-enabled.
Improve this bit would mean rewriting the library,
Yeah it's what I figured :disappointed:
OSX poll implementation is broken. I would have to look at the code, but I think that ruby falls back to select when in OSX. you can check it with strace, though. As for the rest, it's mitigated with a proper fallback to a proper event loop. I've been working on something aside. It will not help your keys_for issue though. :disappointed:
Hi guys, excellent work digging into this. I'm going on the road for a week on Friday, and I don't have time to get too deep into this with you in the next 24h but you are in good hands with @mattbrictson, and I'll be reading my emails during CW6 if either of us can do anything to support you. We can always escalate things up to net/ssh if we need to, just let us know, and we'll help preparing PRs or issues there,
@byroot It looks like with #327 and https://github.com/net-ssh/net-ssh/issues/308 both merged, this issue is now largely resolved.
I did some benchmarking of my own, and while the current SSHKit ConnectionPool does have some unnecessary overhead, in the big picture it is pretty negligible. I think we have reached the point of diminishing returns. Agreed?
net-ssh/net-ssh#308 in itself doesn't fix anything. I'd like to propose one last feature in SSHKit to replace Net::SSH's KnownHost with an alternative implementation that caches the lookup in memory.
Would you agree with such feature?
I'm ok with closing that issue. I got sidetracked, but I'll try to see if there is other performance improvements I could bring, but I don't need this issue open to do so.
@byroot Yes, I'm open to having SSHKit ship with a better/faster KnownHost implementation. Let's keep this issue open then until that is complete.
Also: during the past couple days I worked on a rewrite of the SSHKit ConnectionPool that uses much less mutex locking and moves stale connection eviction to a background thread. If I open a PR would you be willing to test my branch against your 180 servers? :grin:
My hunch is that the mutex overhead is not really a big deal, but I'm curious to see the results.
If I open a PR would you be willing to test my branch against your 180 servers?
Of course, I'd be happy to.
Just wanted to update on this. Today I updated our capistrano recipe to use all the improvements we did so far (including the yet unreleased #330) and as far as I'm concerned this issue is solved.
The reason I opened this issue initially is because when I attempted to deploy both our datacenters in the same recipe it ended up being ~50% slower than doing it in 2 concurrent process like we used to.
But on the second attempts there is no noticeable difference anymore.
There is likely still room for improvement, especially around epoll and friends, but at this point capistrano's performance is satisfactory at our scale (300-400 servers).
@mattbrictson let me know if you need me to do anything else to get #330 merged, other than that I'm good for closing this issue.
Thanks for the feedback @byroot - and for your time and contributions. I'll merge #330 in day or two unless @mattbrictson chimes in in the meantime. I didn't see anything blocking though, but Matt is taking a short break from FOSS, but I do still want to give him the chance, without pressure to have a say.