rclcpp
rclcpp copied to clipboard
Fix data race, declare rclcpp callbacks before the rcl entities
The PR https://github.com/ros2/rclcpp/pull/2002 has introduced a data race on entities destruction (which this PR fixes). For example on destruction of a service, since the service is declared first, and the callback is declared last:
ServiceBase::~ServiceBase() {
// Implicitly, the callback is destroyed before the rmw service, due order of declaration
~on_new_request_callback_
// Before the service is destroyed, the rmw service callback points to a destroyed
// function, which if called causes a segmentation fault.
~service_handle_
}
So this PR changes the order of declaration, ensuring callbacks are destroyed last on entities destruction, avoiding the gap in time in which rmw entities hold a reference to a destroyed function. So all works fine!
I could just close the PR description here since issue is gone, but would be nice to discuss:
What if we don't want to rely on this particular ordering of destruction? Because someone could unawarely change the ordering of declaration in the future.
One solution could be reverting https://github.com/ros2/rclcpp/pull/2002, i.e. clearing callbacks before destruction, which works for subscription/clients, but introduces again an issue on services, which should be fixed somehow:
ServiceBase::~ServiceBase() {
clear_on_new_request_callback(); // Now rmw callback is set to null
// The function above internally tries to access the underlying service_handle_,
// which could have already been fini'ed, so seg-faults.
}
The issue is that there is a particular Service constructor, which allows rcl
to own the service, so even before the (implicit) call of ~ServiceBase()
, the underlying rmw_service
has already been destroyed.
I've been looking on how to check if the rmw_service
has already been destroyed, but there is no currently way to do it.
There is a (currently unused) owns_rcl_handle_ bool, which could be leveraged to verify that. So if done, we could have something like:
ServiceBase::~ServiceBase() {
if (owns_rcl_handle_) {
clear_on_new_request_callback();
} else {
// callback_cleared_ could be a bool set to true when user assigns callback, and
// set to false when the user clears the callback
if (!callback_cleared_) {
throw("Owner of ServiceBase should have cleared the callback");
}
}
}
Ok, if we do care about ordering of construction and destruction, this PR should be enough then.
Regarding your previous comments, questions:
Is this reproducible for user application?
Yes it was. Before https://github.com/ros2/rclcpp/pull/2002, the following example would have had a segmentation fault:
auto n = std::make_shared<rclcpp_lifecycle::LifecycleNode>("node");
n->for_each_callback_group([](rclcpp::CallbackGroup::SharedPtr group)
{
group->find_service_ptrs_if(
[](const rclcpp::ServiceBase::SharedPtr & service) {
if (service) {
service->set_on_new_request_callback([](size_t var){});
}
return false;
});
});
n.reset(); // This line would cause a segmentation-fault
Now we don't have the issue anymore, but we introduced the data race this PR is fixing (reordering declarations).
https://github.com/ros2/rclcpp/pull/2002#discussion_r955251763
std::shared_ptr<rcl_service_t> service_handle_ will be deleted after clear_on_new_request_callback(), right? All member variables are alive in the body of the destructor. So this is something like optimization?
You were right that service_handle_
was deleted after clear_on_new_request_callback()
, the problem was that the inner implem of service_handle_
was already freed not due optimization, but since lifecycle node's owns the rcl
server and frees it on it's destructor (here), before the (implicit) call of ~ServiceBase()
.
clear_on_new_request_callback()
was trying to access that already freed implem.
@mauropasse thanks for the explanation, and @alsora shared the overview about this problem. this ownership could be problem and keeping the ordering would be maintenance cost. probably we can get more feedback on this.
after all, i think this PR itself makes sense to me.
about this problem. this ownership could be problem and keeping the ordering would be maintenance cost. probably we can get more feedback on this.
Just for context for any reader, if we don't want to rely on ordering of destruction of these callbacks, we could revert https://github.com/ros2/rclcpp/pull/2002, thus having
ServiceBase::~ServiceBase() {
clear_on_new_request_callback(); // (*)
// 1. ~on_new_request_callback_ (implicit destructor)
// Even if the callback is destroyed, the rmw listener of service does not point
// anymore to it, since was set to nullptr in clear_on_new_request_callback()
// 2. ~service_handle_ (implicit destructor)
}
(*) This can cause a seg-fault (use-after-free), trying to access the underlying `rcl_service`.
What blocks us from doing this, is this Service constructor which allows to construct a service but without having ownership of the rcl_service
. That constructor is used currently by lifecycle nodes to create it's inner services.
It is dangerous, since it can free the rcl_service
's at any time, and there is no way to check if this has happened from the Service
class, which might want to use it (thus use-after-free
segfault).
The Service
class seems to be the only entity which allows to be created without owning the underlying rcl
service, which seems to defeat the purpose of having a Service
class as a wrapper of the rcl_service
, since it does not own it.
Ideally that constructor should not exist, and LifecycleNode should find a way to not own those entities, and let the rclcpp Service
take care of rcl_service
init and fini.
This pull request has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2022-10-13/28213/1
@iuhilnehc-ynos requesting another review.
@fujitatomoya I think we need to revert the https://github.com/ros2/rclcpp/pull/2002 and deal with the exception of calling clear_on_new_response_callback
in the destructor based on your comment. I don't know what kind of abnormal case is intended initially to fix by https://github.com/ros2/rclcpp/pull/2002.
@iuhilnehc-ynos basically https://github.com/ros2/rclcpp/pull/2002 fixes what I describe here: https://github.com/ros2/rclcpp/pull/2024#issuecomment-1276227416 Based on discussions on this PR, to not depend on particular order of declarations, I'd partially revert https://github.com/ros2/rclcpp/pull/2002 i.e. clear callbacks on destruction, but not for services since it seg-faults for lifecyle nodes destruction. But the problem is as @fujitatomoya said, it might generate the exception in dtor. In conclusion, this PR as it is, is enough to make all work fine (no exceptions on destructors, no segfaults on lifecycle nodes destruction).
@iuhilnehc-ynos do you have any comment to https://github.com/ros2/rclcpp/pull/2024#issuecomment-1333637948?
Hi, any update on this?
https://github.com/ros2/rclcpp/pull/2024#issuecomment-1382125762
@iuhilnehc-ynos friendly ping.
I just expressed my concern that adjusting the order can't promise the destructor order of these two members. If the situation can't happen, please go ahead with this PR.
CI with repos https://gist.github.com/alsora/2cb20dc394c9722625b409f4a8566c5f