geometry2 icon indicating copy to clipboard operation
geometry2 copied to clipboard

Fix race condition in MessageFilter

Open rhaschke opened this issue 3 years ago • 3 comments

A bug reported in https://github.com/ros-visualization/rviz/issues/1753 pointed to race condition(s) in tf2_ros::MessageFilter. I implemented a stress test to analyze the problem in detail. There were two issues:

  1. A race condition routing back to #91, #101, #144 As pointed out in #144 the deadlock arises due to an inversion of locking order:
    • MessageFilter::add(), when removing the oldest message, locks:
      • MessageFilter::messages_mutex_
      • BufferCore::transformable_requests_mutex_ (via cancelTransformableRequest())
    • BufferCore::testTransformableRequests() locks:
      • transformable_requests_mutex_
      • MessageFilter::message_mutex_ (via MessageFilter::transformable() callback)
  2. A race condition with CBQueueCallback When MessageFilter::signalMessage() is called asynchronously via a ROS callback queue, this call might still be in progress (holding Signal1::mutex_), while another thread might attempt to remove the MessageFilter. This results in a failed assertion as the MessageFilter's destructor attempts to destroy a locked mutex.

PR #144 attempted to resolve the first issue by postponing callback calls. However, this has the drawback of using references to TransformableRequests outside the protection by transformable_requests_mutex_. Thus, these requests might get deleted by another thread, resulting in a segfault as revealed by the stress test. Here the deadlock is resolved in MessageFilter::add, canceling the requests outside the messages_mutex_ protection.

To resolve the second race condition, another mutex is introduced in MessageFilter, to ensure that CBQueueCallback::call() is finished before destroying the MessageFilter.

These two changes make the stress test pass. @tfoote, please note that this is a PR against Melodic. I will prepare a port for Noetic as well.

rhaschke avatar Jul 01 '22 13:07 rhaschke

Looks like there is still a deadlock :disappointed:

rhaschke avatar Jul 01 '22 14:07 rhaschke

Looks like there is still a deadlock

Resolved as well. But I still see occassionally a race condition when using a ROS callback queue to signal messages, causing the assertion failure of issue (2).

rhaschke avatar Jul 01 '22 15:07 rhaschke

I'm afraid the remaining race condition cannot be fixed. Rather, the ROS callback queue mechanism should be avoided.

While MessageFilter::clear() removes pending messages from the callback queue, there might be still callbacks active when attempting to remove the MessageFilter. Depending on what goes first, either the already destroyed mutex is tried to lock or a locked mutex is tried to destroy, both resulting in an assertion failure.

rhaschke avatar Jul 01 '22 19:07 rhaschke

This PR targets melodic which is EOL, closing this PR. There is already another PR targeting noetic https://github.com/ros/geometry2/pull/539.

ahcorde avatar Feb 02 '24 14:02 ahcorde

I'm fine that you closed this PR, @ahcorde. But it is a pity that this PR was lying around for 19 months w/o any feedback and now being closed due to EOL reasons. I hope the Noetic port (#539) will be considered at some point :wink:

rhaschke avatar Feb 02 '24 16:02 rhaschke