ruby-mqtt
ruby-mqtt copied to clipboard
Prevent deadlock due to system call error
The MQTT client runs in the main thread which consumes incoming MQTT messages from a Thread::Queue called @read_queue. This queue is fed by the @read_thread, a child thread which reads from a socket in an infinite loop.
We noticed that our application stopped processing incoming MQTT messages, but did not seem to exit or throw an exception either. Upon inspecting the logs, we saw that the @read_thread had crashed due to an unhandled Errno::ECONNRESET while reading from the socket. This had the consuming thread then sleep forever while waiting for new meassages on the @read_queue.
It turned out that the MQTT::Client#receive_packet method had appropriate error handling in place
def receive_packet
# ...
rescue Exception
# ...
end
but this did not rescue Errno::ECONNRESET even though Exception is at the top of the hierarchy of
Ruby's built-in exception classes.
The root cause was that the library also defines a class MQTT::Exception and in the context of #receive_packet the constant Exception refers only to MQTT::Exception (which does not cover Errno::ECONNRESET) where it actually should rescue any subclass of ::Exception.
@phlegx FYI
I'm running into a threaded processing lockup that seems to be a similar issue
Any follow up on getting this merged? This was a real problem for us a few months ago.
We found setting Thread.current.abort_on_exception = true would at least kill the listening process for us (and allow our system to reboot and rebuild the connection), but this fix is much cleaner and less janky.
Thanks for this. The exception code has been wrong for a very long time and it is long overdue getting sorted out.
According to Semver, do you think this change is worthy of a Major, Minor or Patch increment? 🤔
Hey @njh, good to see this progressing!
Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable. https://semver.org/#spec-item-4
Since you're just merging fixes (I assume) I'd argue that is only a PATCH increment. If you're also taking #112, that one mandates a cautionary note in CHANGELOG.md.