moped Hangs forever in refresh after 5 minutes when 2ary unreachable

tl;dr setsockopt(SO_RCVTIMEO) doesn't work for us in Ruby 2.[01] MRI, infinite hangs result

We're having trouble in our Mongo cluster when a secondary becomes unreachable (due to e.g. crashed instance or network partition): the symptom is that after five minutes all attempts to contact the primary seem to hang indefinitely (or until Unicorn kills our worker process). Note that our mongoid.yml has options: {read: primary}

Attempting to replicate the problem in a test environment gets us a call stack something like this (some frames omitted for brevity)

     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:47:in `read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:47:in `block in read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:119:in `handle_socket_errors'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection/socket/connectable.rb:47:in `read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection.rb:221:in `with_connection'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/connection.rb:114:in `read'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/protocol/query.rb:213:in `receive_replies'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/node.rb:180:in `block in ensure_connected'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/node.rb:432:in `refresh'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/cluster.rb:182:in `block in refresh'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/cluster.rb:151:in `nodes'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/cluster.rb:240:in `with_primary'
     # /home./.bundler/ruby/2.1.0/moped-4c2a0a29122e/lib/moped/read_preference/primary.rb:55:in `block in with_node'

and after some digging the problem seems to be that implementing timeouts using socket options doesn't work in Ruby MRI. Looking at it with strace I see a call to read that returns after five seconds with EAGAIN followed by a call to ppoll - the EAGAIN doesn't ever get as far as interpreted code.

getsockname(7, {sa_family=AF_INET, sin_port=htons(44829), sin_addr=inet_addr("192.168.195.6")}, [16]) = 0
setsockopt(7, SOL_SOCKET, SO_SNDTIMEO, "\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
futex(0x7f24117f6654, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7f24117f66d0, FUTEX_WAKE_PRIVATE, 1) = 0
select(8, [7], NULL, [7], {0, 0})       = 0 (Timeout)
read(7, 0x7f2413130f00, 100)            = -1 EAGAIN (Resource temporarily unavailable)
ppoll([{fd=7, events=POLLIN}], 1, NULL, NULL, 8  <hang indefinitely>

See also http://stackoverflow.com/questions/9853516/set-socket-timeout-in-ruby-via-so-rcvtimeo-socket-option

The attached patch has a spec that demonstrates the problem (try running it under strace) and a fix which implements read timeouts "by hand" using select() instead of relying on socket options. A more full solution would probably address the write() case and be tested on SSL connections as well as plaintext ones, but I thought I'd push this early diagnosis up to you for comment first to see if I'm on the right track

/cc @dawid-sklodowski @cpoo22 just fyi

Nov 03 '14 17:11 telent

Coverage decreased (-0.06%) when pulling c56498f06d745c32fc0bcd8f32250afc75b57213 on simplybusiness:timeout_in_read into 7e5ac3eca7ce508b90d9ee866537367f87f254ee on mongoid:master.

Nov 03 '14 17:11 coveralls

cc @ngsmrk @dcoxall

Dec 19 '14 15:12 telent

We experience this issue as well. Would love to see it merged.

Apr 08 '15 15:04 jonhyman

@telent, have you been running this in production? I tested this out on production and had some really strange errors (moped cursor returning empty documents) that were sporadic and hard to reproduce. I reverted the change for now but will do some more investigation later.

Apr 24 '15 04:04 jonhyman

Though to be clear, I had done it by copying the code into a fork of Moped 1.5. I didn't try with Moped 2 yet.

Apr 24 '15 04:04 jonhyman

I tried testing this again by issuing a scatter-gather query on a massive collection with hundreds of millions of documents. This wreaked havoc on the database, as it appears that the query continually kept getting retried but the cursor wasn't closed. As such, before I CTRL+C'd it, we had 19 identical queries running which caused massive queued reads and blocking. I'm not sure this is safe. Though, I had retrofitted it to 1.5, so it may work in 2.0 but I am bearish on this right now given what happened to us.

Apr 27 '15 17:04 jonhyman

@durran have you thought about this problem at all in the new ruby driver? As @telent points out, SO_RCVTIMEO simply does not work in MRI anymore. We experience this issue when we have a bad networking event or a physical host which has multiple mongoDs freezes or crashes; our processes completely hang and must be kill -9'd. We have an identical stack trace to what @telent describes in the issue. Reviewing the code for the new driver, I think that we'll have the same issue as long as IO.select is not used before the socket read. This pull as it is is problematic for us, my guess is because if a query takes longer than @timeout to return, then the ConnectionFailure causes it to be re-issued (as I mentioned above, we wreaked havoc on a database with this pull and a 10 second timeout), but this is a real operational issue that causes really bad problems for us.

Aug 23 '15 18:08 jonhyman

@jonhyman I have a branch up on the new driver that actually does an IO.select before selecting a server to use in an operation... (https://github.com/mongodb/mongo-ruby-driver/pull/670/files#diff-ea2082a78a5d9b36b80ce79b7705943dR86) The server.connectable? call is eventually calling a select before returning it, and the operation would execute immediately after.

Aug 23 '15 20:08 durran

I may be wrong, but I don't think that that's going to help. Going back to Moped and this issue for a second, the Moped code already has a check_if_alive! immediately before the read, but one issue is that alive? is defined as this (in both Moped and it's roughly identical in the new driver)

def alive?
  if Kernel::select([ self ], nil, [ self ], 0)
    !eof? rescue false
  else
    true
  end
rescue IOError
  false
end

so if select returns nil, alive? still returns true. The alive? check has a 0 second timeout, so all you know is that the socket is not dead, but if it returns nil then it does not have data on it right now and is not ready for reading. Because the current Moped code does this:

check_for_alive!
handle_socket_errors { super }

you don't know if the socket is ready for reading when you go to do a blocking read, and the super call hangs indefinitely as @telent's stack trace shows (and we have that exact stack trace when it happens to us).

In your pull, connectable? simply does an alive? check, which, if select returns nil signifying that the socket is not ready for reading, connectable? will still return true. Then you'll end up calling @socket.read in https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/socket.rb#L172 and I think this code will be vulnerable to hanging again (I have not yet tested, your pull is the last one I'm waiting on before we perform failover testing scenarios with the new driver :grin:).

From all my testing locally, I can't figure out how to solve this hanging case in Ruby that does not involve using select to set a "maximum read time" by making read_from_socket look something like this:

def read_from_socket(length)
  begin
    @socket.read_nonblock(length) || String.new
  rescue IO::WaitReadable, OpenSSL::SSL::SSLErrorWaitReadable
    ready = IO.select([@socket], nil, [@socket], 300)
    if ready
      retry
    else
      raise StandardError.new("Maximum query time exceeded, no data from socket returned in 300 seconds.")
    end
  end
end

Note here I'm not raising a ConnectionFailure to retry, I'm literally failing, because if it was in fact the case that the query should take longer than the max timeout, the ConnectionFailure will cause the query to get re-issued, and you can get in the loop that we got into 4 months ago when I tested this and ran a dozen and a half huge, slow scatter-gather queries on a very large cluster and caused high locking/queued reads.

Aug 23 '15 20:08 jonhyman

moped moped copied to clipboard

Hangs forever in refresh after 5 minutes when 2ary unreachable

moped
moped copied to clipboard