sshkit Issue: Parallel executor performance is mostly linear

We use capistrano to deploy on several hundred servers at once, and we noticed capistrano's performance was heavily tied to the number of servers.

To prove it I ran the following benchmark:

require 'benchmark'
require 'csv'
task :bench do
  servers = roles(:borg).to_a
  timings = {}

  if ENV['PREWARM']
    on servers do
      execute :echo, 'pong > /dev/null'
    end
  end

  [1, 10, 50, 100].each do |count|
    timings[count] = Benchmark.measure do
      on servers.pop(count) do
        execute :echo, 'pong > /dev/null'
      end
    end
  end

  res = [%w(count user system real)]
  timings.each do |count, timing|
    res << [count, timing.utime, timing.stime, timing.real]
  end
  puts res.map(&:to_csv).join
end

Here are the results: Spreadsheet

The first graph is with "cold" connections, meaning it's connection establishment plus the command execution. In the second graph, all the connections were pre established before the benchmark.

I'm still investigating to figure out where exactly the bottleneck (or bottlenecks) exactly is. I know the GIL is not for nothing, but capistrano / SSHKit being IO heavy, I think there is other reasons.

cc @kirs as well as @csfrancis

Feb 02 '16 23:02 byroot

Oh and I forgot, here's a profile of a similar benchmark with about 180 severs:

Cold:

==================================
  Mode: wall(1000)
  Samples: 3592 (6.04% miss rate)
  GC: 390 (10.86%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       626  (17.4%)         620  (17.3%)     SSHKit::Runner::Parallel#execute
       511  (14.2%)         511  (14.2%)     block (3 levels) in Net::SSH::KnownHosts#keys_for
       884  (24.6%)         306   (8.5%)     block (2 levels) in Net::SSH::KnownHosts#keys_for
       158   (4.4%)         111   (3.1%)     Net::SSH::Transport::Kex::DiffieHellmanGroup1SHA1#send_kexinit
       203   (5.7%)         109   (3.0%)     OpenSSL::PKey::DH#valid?
        94   (2.6%)          94   (2.6%)     block in OpenSSL::PKey::DH#valid?
        89   (2.5%)          89   (2.5%)     Net::SSH::Compat.io_select
        60   (1.7%)          60   (1.7%)     block (2 levels) in Net::SSH::Transport::ServerVersion#negotiate!
        55   (1.5%)          55   (1.5%)     block (2 levels) in Net::SSH::Connection::Channel#forward_local_env
        52   (1.4%)          52   (1.4%)     Net::SSH::KnownHosts#known_host_hash?
       956  (26.6%)          51   (1.4%)     Net::SSH::KnownHosts#keys_for
        93   (2.6%)          49   (1.4%)     block in Net::SSH::Config.load
        59   (1.6%)          46   (1.3%)     Net::SSH::Buffer#read
       127   (3.5%)          34   (0.9%)     Net::SSH::Config.load
        33   (0.9%)          33   (0.9%)     Net::SSH::Buffer#write_long
       181   (5.0%)          33   (0.9%)     Net::SSH::Transport::PacketStream#poll_next_packet
        64   (1.8%)          32   (0.9%)     Net::SSH::BufferedIo#fill
        52   (1.4%)          31   (0.9%)     Net::SSH::BufferedIo#send_pending
        30   (0.8%)          30   (0.8%)     SSHKit::Backend::ConnectionPool::Entry#expired?
        29   (0.8%)          29   (0.8%)     Logger#debug?
        33   (0.9%)          28   (0.8%)     Net::SSH::Transport::State#update_next_iv
        27   (0.8%)          27   (0.8%)     Net::SSH::Buffer#length
        27   (0.8%)          27   (0.8%)     Net::SSH::Config.pattern2regex
        26   (0.7%)          26   (0.7%)     Net::SSH::Transport::HMAC::Abstract.mac_length
        85   (2.4%)          23   (0.6%)     Net::SSH::Buffer.from
        22   (0.6%)          22   (0.6%)     OpenSSL::PKey::RSA#ssh_do_verify
      1374  (38.3%)          21   (0.6%)     block in Net::SSH::Transport::Session#poll_message
       905  (25.2%)          21   (0.6%)     block in Net::SSH::KnownHosts#keys_for
        20   (0.6%)          20   (0.6%)     Net::SSH::Buffer#initialize
        17   (0.5%)          17   (0.5%)     Net::SSH::KnownHosts#initialize

Warmed:

==================================
  Mode: wall(1000)
  Samples: 450 (6.83% miss rate)
  GC: 98 (21.78%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
        54  (12.0%)          54  (12.0%)     SSHKit::Backend::ConnectionPool::Entry#expired?
        50  (11.1%)          50  (11.1%)     Net::SSH::Compat.io_select
        28   (6.2%)          28   (6.2%)     block (2 levels) in Net::SSH::Connection::Channel#forward_local_env
        11   (2.4%)          11   (2.4%)     Net::SSH::Transport::HMAC::Abstract.mac_length
        11   (2.4%)          10   (2.2%)     Net::SSH::Buffer#read
        22   (4.9%)          10   (2.2%)     Net::SSH::BufferedIo#fill
         9   (2.0%)           9   (2.0%)     Net::SSH::Transport::State#update_next_iv
        12   (2.7%)           9   (2.0%)     Net::SSH::BufferedIo#send_pending
        63  (14.0%)           8   (1.8%)     block (2 levels) in SSHKit::Backend::ConnectionPool#prune_expired
         8   (1.8%)           8   (1.8%)     Net::SSH::Buffer#initialize
         7   (1.6%)           7   (1.6%)     Net::SSH::Buffer#append
         7   (1.6%)           7   (1.6%)     Thread.create_with_logging_context
         6   (1.3%)           6   (1.3%)     Net::SSH::Buffer#write_long
         5   (1.1%)           5   (1.1%)     Logger#debug?
         5   (1.1%)           5   (1.1%)     block in Net::SSH::Transport::PacketStream#enqueue_packet
         5   (1.1%)           5   (1.1%)     Net::SSH::Transport::HMAC::Abstract.digest_class
         5   (1.1%)           5   (1.1%)     #<Module:0x007fafe25598f0>.reporter
         9   (2.0%)           4   (0.9%)     Net::SSH::Transport::State#update_cipher
         9   (2.0%)           4   (0.9%)     SSHKit::Color#colorize
         4   (0.9%)           4   (0.9%)     Net::SSH::BufferedIo#input
         4   (0.9%)           4   (0.9%)     block (3 levels) in <class:Digest>
       271  (60.2%)           4   (0.9%)     Net::SSH::Connection::Session#dispatch_incoming_packets
         4   (0.9%)           4   (0.9%)     SSHKit::Color#colorize?
        10   (2.2%)           4   (0.9%)     Net::SSH::Buffer#read_long
         8   (1.8%)           4   (0.9%)     block in Net::SSH::Packet#instantiate!
         3   (0.7%)           3   (0.7%)     Net::SSH::Buffer#length
        14   (3.1%)           3   (0.7%)     Capistrano::DSL#on
        13   (2.9%)           3   (0.7%)     Net::SSH::Buffer.from
         3   (0.7%)           3   (0.7%)     SSHKit::Backend::Abstract#initialize
         3   (0.7%)           3   (0.7%)     Net::SSH::Buffer#to_s

Feb 03 '16 00:02 byroot

Interesting! I am also surprised that the results are linear. Please keep us updated on your findings. I wonder if JRuby results are any different.

Feb 03 '16 01:02 mattbrictson

Well, while posting the profile I figured that:

54  (12.0%)          54  (12.0%)     SSHKit::Backend::ConnectionPool::Entry#expired?

Was very sketchy. Turns out according to our monitoring, running migrations very often takes more than 30 seconds (we use docker so, docker pull, plus rails boot it's easy to cross that mark). Which mean all the connections established before the deploy:migrate, have to be re-established after.

So I'll see tomorrow but I think SSHKit.config.backend.pool.idle_timeout = 120 should help quite a bit in our real world situation.

But it still doesn't explain the benchmark results so I'll keep digging.

Feb 03 '16 01:02 byroot

Based on @csfrancis input and also on https://github.com/net-ssh/net-ssh/pull/303, it seems likely that the linear performance comes from a good part of IO.select which get slower the more file descriptors it has to watch.

If https://github.com/net-ssh/net-ssh/pull/303 goes through, it would be very interesting to experiment with an evented backend.

However, the profile clearly shows that a huge amount of time is spent in SSHKit::Backend::ConnectionPool::Entry#expired?. After investigation it's clear that the ConnectionPool put way to much zeal into pruning expired connection.

The prune_expired method that iterates over all the connections is called both on checkin and on checkout, so it's executed twice per host and per command.

e.g. if you execute 5 commands, on 10 servers, it will be executed 2 * 5 * 10, 100 times! Also note that it synchronize a mutex, so it's very probable that it kills limits the concurrency.

I'll start working on a PR to see if those prune could be batched somehow, e.g triggered after X checkins/checkouts.

I'd also like to be able to disable the idle_timeout check without losing the connection pooling entirely.

Feb 03 '16 04:02 byroot

~~Quick update, it seems like idle_timeout disabling connection pooling entirely is a regression from https://github.com/capistrano/sshkit/pull/298. ~~ Sorry I was wrong.

Feb 03 '16 04:02 byroot

Based on @csfrancis input and also on net-ssh/net-ssh#303, it seems likely that the linear performance comes from a good part of IO.select which get slower the more file descriptors it has to watch.

Maybe strace -C would tell, as in it would measure the time on just that syscall.

Feb 03 '16 04:02 camilo

https://github.com/capistrano/sshkit/pull/327 allow to disable pruning entirely for those interested. I'm still ensure how to reduce the prune_expired calls. Maybe with a counter?

Feb 03 '16 05:02 byroot

By inspecting the profiles, I also discovered that net/ssh waste a lot of CPU parsing the same known_hosts files on every connection establishment.

I'll try to see if they could be cached, ref: https://github.com/net-ssh/net-ssh/issues/307

Feb 03 '16 15:02 byroot

Just for adding up, and since you've mentioned my pull request still being evaluated: the known hosts parsing is a huge bottleneck in net-ssh. As you have correctly mentioned, the parsing will occur for every host you connect to. Performance of this operation is dependent of the file system you're running it from (if you're in a networked-FS, as I usually work from, this kills your performance).

Opening a file system descriptor will also kill any chance you want to have of toying around an evented framework, as the ruby implementations I know (nio4r, eventmachine) famously only listen to network sockets, which means that every File I/O will block the event loop.

IO.select calls do get expensive, but just because they're spread across the library for read/write checks. If the pull request gets accepted, the number of IO.select direct calls (io/wait still falls be to select() when no poll() is implemented) will be reduced to 1 per session traffic processing. For this there is currently no optimal solution in sight.

Feb 03 '16 16:02 HoneyryderChuck

@TiagoCardoso1983 I tried net-ssh/net-ssh#303 but only got very marginal performance improvements. Is it because the main select in #process is still not patched?

Feb 03 '16 16:02 byroot

@byroot which OS are you testing it on? Unless you're in a poll()-enabled one, it's the same thing. Even if you're in a poll()-enabled environment, the bulk of the work is done by the #process call, which is unpatchable in the current state and uses IO.select. Just for you to understand what I mean, net-ssh has a loop but is far from being truly non-blocking. What it does: it checks whether there is something to read/write, proceeding to read/write with the Socket blocking APIs (limiting the number of bytes transmitted, yes). Truly nonblocking would be: (1) try to read/write "non-block, (2a) succeed, or (2b) wait for read/write, and go back to (1). Improve this bit would mean rewriting the library, I don't know if that's feasible.

Feb 03 '16 16:02 HoneyryderChuck

OS are you testing it on?

OSX, so it's poll-enabled.

Improve this bit would mean rewriting the library,

Yeah it's what I figured :disappointed:

Feb 03 '16 19:02 byroot

OSX poll implementation is broken. I would have to look at the code, but I think that ruby falls back to select when in OSX. you can check it with strace, though. As for the rest, it's mitigated with a proper fallback to a proper event loop. I've been working on something aside. It will not help your keys_for issue though. :disappointed:

Feb 03 '16 20:02 HoneyryderChuck

Hi guys, excellent work digging into this. I'm going on the road for a week on Friday, and I don't have time to get too deep into this with you in the next 24h but you are in good hands with @mattbrictson, and I'll be reading my emails during CW6 if either of us can do anything to support you. We can always escalate things up to net/ssh if we need to, just let us know, and we'll help preparing PRs or issues there,

Feb 03 '16 21:02 leehambley

@byroot It looks like with #327 and https://github.com/net-ssh/net-ssh/issues/308 both merged, this issue is now largely resolved.

I did some benchmarking of my own, and while the current SSHKit ConnectionPool does have some unnecessary overhead, in the big picture it is pretty negligible. I think we have reached the point of diminishing returns. Agreed?

Feb 05 '16 20:02 mattbrictson

net-ssh/net-ssh#308 in itself doesn't fix anything. I'd like to propose one last feature in SSHKit to replace Net::SSH's KnownHost with an alternative implementation that caches the lookup in memory.

Would you agree with such feature?

I'm ok with closing that issue. I got sidetracked, but I'll try to see if there is other performance improvements I could bring, but I don't need this issue open to do so.

Feb 05 '16 20:02 byroot

@byroot Yes, I'm open to having SSHKit ship with a better/faster KnownHost implementation. Let's keep this issue open then until that is complete.

Also: during the past couple days I worked on a rewrite of the SSHKit ConnectionPool that uses much less mutex locking and moves stale connection eviction to a background thread. If I open a PR would you be willing to test my branch against your 180 servers? :grin:

My hunch is that the mutex overhead is not really a big deal, but I'm curious to see the results.

Feb 05 '16 21:02 mattbrictson

If I open a PR would you be willing to test my branch against your 180 servers?

Of course, I'd be happy to.

Feb 05 '16 21:02 byroot

Just wanted to update on this. Today I updated our capistrano recipe to use all the improvements we did so far (including the yet unreleased #330) and as far as I'm concerned this issue is solved.

The reason I opened this issue initially is because when I attempted to deploy both our datacenters in the same recipe it ended up being ~50% slower than doing it in 2 concurrent process like we used to.

But on the second attempts there is no noticeable difference anymore.

There is likely still room for improvement, especially around epoll and friends, but at this point capistrano's performance is satisfactory at our scale (300-400 servers).

@mattbrictson let me know if you need me to do anything else to get #330 merged, other than that I'm good for closing this issue.

Mar 16 '16 04:03 byroot

Thanks for the feedback @byroot - and for your time and contributions. I'll merge #330 in day or two unless @mattbrictson chimes in in the meantime. I didn't see anything blocking though, but Matt is taking a short break from FOSS, but I do still want to give him the chance, without pressure to have a say.

Mar 16 '16 06:03 leehambley

sshkit sshkit copied to clipboard

Issue: Parallel executor performance is mostly linear

sshkit
sshkit copied to clipboard