rosbridge_suite RosbridgeProtocol instance clean-up hangs when client disconnects under specific conditions.

Description Under specific conditions that I have not yet been able to pin down a client disconnection will not be gracefully handled, leading to the server attempting to forward messages to that (closed) client's websocket and thus spamming errors. This also leads to a leakage in resources and eventual lock up of the rosbridge process.

The problem seems to come from this last part of the Protocol.incoming function. After adding a bunch of logging it seems that this blocks the IncomingQueue.run loop and thus the protocol.finish of a given client is never triggered.

As mentioned I have yet to find a minimal way to reproduce the problem, but we have frequently encountered this when there are rapid connections/disconnections happening and the rosbridge instance is under load.

I was able to "fix" the problem by improving the behavior regarding the remaining message that is kept in self.buffer here but would like some input on why this is here and how we can fix it properly.

Thanks in advance, and I believe this could explain some of the other issues that went stale in the past.

Library Version: latest ros2 branch
ROS Version: Humble
Platform / OS: Ubuntu 22.04 (docker)

Steps To Reproduce I have yet to find a reliable way of reproducing the problem, but from my experience the following conditions seem to trigger the problem:

10 clients that rapidly connect, subscribe to a few topics and call a "long" running service, followed by a disconnection. From the debugging I have done it doesn't seem that the service call should be needed but perhaps it helps use-up some resources that help trigger the buffering of the websocket.

Expected Behavior A client disconnecting should always result in the respective RosbridgeProtocol instance (and respective Capabilities) being cleaned up.

Actual Behavior Sometimes the clean up (.finish) seems to hang and resources remain being used against a closed websocket.

Dec 05 '23 16:12 ramlab-jose

I encountered the same issue and might have found the cause. This loop can keep sending the first element of the queued messages.

https://github.com/RobotWebTools/rosbridge_suite/blob/7d78af16d30d0ffe232abcc65d0928ce90bd61f7/rosbridge_library/src/rosbridge_library/internal/subscription_modifiers.py#L164-L168

Jul 02 '24 02:07 daisukes

Hi @daisukes, at the time I managed to reduce the occurrence of this problem, although I never found a way to consistently reproduce it. See this commit for my approach (admittedly not the cleanest).

Jul 02 '24 13:07 ramlab-jose

Hi @ramlab-jose

Thanks, I will try it as well!

I had very similar issues with my application, which shows a map, laser scans, and the robot's position by using TF, as described in the first description of your comment.

Under specific conditions that I have not yet been able to pin down a client disconnection will not be gracefully handled, leading to the server attempting to forward messages to that (closed) client's websocket and thus spamming errors. This also leads to a leakage in resources and eventual lock up of the rosbridge process.

I could reproduce it by quickly refreshing my app, but I did not always get the exact symptoms.

lock up (cannot connect to the server) and keep trying to send messages to a closed socket
can connect, but keep trying to send messages to a closed socket
just lock up
...

I found that when it happens, python threads get stuck in infinite loops at two places so far.

Jul 02 '24 15:07 daisukes

rosbridge_suite rosbridge_suite copied to clipboard

RosbridgeProtocol instance clean-up hangs when client disconnects under specific conditions.

rosbridge_suite
rosbridge_suite copied to clipboard