ruby-mqtt icon indicating copy to clipboard operation
ruby-mqtt copied to clipboard

Do nonblocking socket read before select for packet receive

Open camlow325 opened this issue 6 years ago • 1 comments

I've periodically run into some cases where the receive_packet method hangs indefinitely on an IO.select even though the underlying network socket has actually received some bytes from the network for incoming MQTT packets. I've only been able to consistently reproduce this when running with JRuby 9k for some reason but I believe the underlying condition affects MRI Ruby as well.

Specifically, I believe the condition that I'm running into is the same as the buffering issue documented for OpenSSL::SSL::SSLSocket and IO.select calls at http://ruby-doc.org/core-2.2.0/IO.html#method-c-select.

Especially, the combination of nonblocking methods and IO.select is preferred for IO like objects such as OpenSSL::SSL::SSLSocket. It has to_io method to return underlying IO object. IO.select calls to_io to obtain the file descriptor to wait. This means that readability notified by IO.select doesn’t mean readability from OpenSSL::SSL::SSLSocket object. Most possible situation is OpenSSL::SSL::SSLSocket buffers some data. IO.select doesn’t see the buffer. So IO.select can block when OpenSSL::SSL::SSLSocket#readpartial doesn’t block.

I experimented with changing the logic in receive_packet to use a combination of a read_nonblock call (reading only a single byte at most) followed by an IO.select call in the event that no data is available to be read and an IO.WaitReadable-derived exception is thrown. This PR shows the approach that I used. It does make the logic a bit more messy in that for cases that the first byte in the packet is read before MQTT:Packet.read() is called, it passes along the byte so that the Packet.read() method can use it.

With the approach on this PR, I'm no longer able to reproduce the indefinite IO.select() hangs, on MRI Ruby or JRuby. The read throughput also seems like it might be a bit better with a heavy amount of publish packets coming into the socket since available data seems to be received much more quickly.

Thanks for your consideration for this PR and thanks much for all of the great work that you've put into this library!

camlow325 avatar Nov 14 '17 18:11 camlow325

The original commit that I had on the PR, b6cda35, failed the Travis run for 1.8.7. I hadn't remembered that the non-blocking APIs that the commit was using did not exist until 1.9. In a new commit, d23a669, I added some logic which allows for the non-blocking APIs to only be used if available. The tests appear to be passing across Ruby versions and continue to no longer trigger the indefinite hang on select that I have been seeing for JRuby 9k.

camlow325 avatar Nov 15 '17 00:11 camlow325