Fast-DDS Service Reliability sometimes hangs lifecycle manager
Bug report
- Operating System:
- Ubuntu 20.04
- ROS2 Version:
- rolling (focal)
- Version or commit hash:
- main (3ecb4d0c76d7986f1b666b0e96ace739d466169c)
- DDS implementation:
- FastDDS
Steps to reproduce issue
Give lifecycle manager a node to manage that is erroring out, and for some reason doesn't create its lifecycle services properly.
Expected behavior
Failure to connect to lifecycle state services should eventually timeout, or indicate that the lifecycle node it is trying to manage is not working.
Actual behavior
lifecycle manager blocks indefinitely.
Additional information
I have a hard time getting this to reproduce reliably. I think the lifecycle services failing to come up has something to do with FastDDS shared mem. This is happening with just nav2 controller_server for me.
This is resulting in behavior observed in #3027, since the Controller server is unable to configure to provide the FollowPath action.
Ok, actually I think this is partially a duplicate of #2917. However, I still think the infinite timeout is problematic.
I think the lifecycle services failing to come up has something to do with FastDDS shared mem
Do you see this with Cyclone? If not, then you should file this with Fast-DDS as a problem to resolve.
This is happening with just nav2 controller_server for me.
Try reordering the bringup sequence for lifecycle, my guess is its not actually an issue with the controller server, its just first in line.
How is it a duplicate of #2917? That was on nested timer issues.
Do you have a specific suggested change? From what I read in this, it seems like the service clients aren't being made properly, and without more specific information / lines / issue descriptions, I can only interpolate what I think you mean by that. I think you mean that they're just straight up not working, which is not a Nav2 issue, its a DDS/RMW issue. But maybe you mean something we can change here, but I'm not sure what that is without more detail.
No, this never happens with CycloneDDS.
File a ticket please with Fast-DDS and link back here so there's a thread connection. It sounds like that's really the problem -- unless you found something we can do in Nav2 to make it work? But it sounds like this is a communication issue that's out of this project's control. If there's something actionable for us here, I'd be happy to discuss it
CC @MiguelCompany @EduPonz Another DDS related issue has cropped up
My suggestion is to not use wait_for_service with an essentially infinite timeout. Even if it is an RMW failure, using an infinite timeout masks it and just results in a hanging process. I would propose using a large, but finite timeout. It could default to be something reasonable (5 minutes?) but maybe it would be a good parameter.
The fact that this is an issue in Fast-DDS is an issue that should be reported to them, though. We can add a timeout so in case of failure we don't hang forever, but is problematic that we're failing in the first place when moving to Fast-DDS.
Well I don't think they are the same thing -- since they are in different places. Unless you want it the idea to be "throughout Nav2 don't have infinite timeouts waiting for action server / service servers"
Ah you're right, backed that out
My suggestion is to not use wait_for_service with an essentially infinite timeout. Even if it is an RMW failure, using an infinite timeout masks it and just results in a hanging process. I would propose using a large, but finite timeout. It could default to be something reasonable (5 minutes?) but maybe it would be a good parameter.
Feel free to propose a PR. 5 min seems a little nuts, 10 sec seems more in line with reasonable even if respawned from a remote server, but yeah, I think a parameter would be good
Here is where the controller server backtrace is showing in this state where the lifecycle manager is waiting for it:
#1 0x00007f01b3e21e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
from /lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007f01b320997f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
from /opt/ros/rolling/lib/librmw_fastrtps_shared_cpp.so
#3 0x00007f01b325810a in rmw_wait () from /opt/ros/rolling/lib/librmw_fastrtps_cpp.so
#4 0x00007f01b3b02577 in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#5 0x00007f01b428031e in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
() from /opt/ros/rolling/lib/librclcpp.so
#6 0x00007f01b42807d3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/rolling/lib/librclcpp.so
#7 0x00007f01b4284ead in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/rolling/lib/librclcpp.so
#8 0x00007f01b4283f28 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
from /opt/ros/rolling/lib/librclcpp.so
#9 0x00005557bc0643b8 in main (argc=<optimized out>, argv=0x7ffef583d478)
at /usr/include/c++/9/bits/shared_ptr_base.h:1020
meanwhile, the lifecycle manager backtrace is here:
#0 0x00007fa09d893376 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007fa09dd91e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
from /lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007fa09d3e597f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
from /opt/ros/rolling/lib/librmw_fastrtps_shared_cpp.so
#3 0x00007fa09d43410a in rmw_wait () from /opt/ros/rolling/lib/librmw_fastrtps_cpp.so
#4 0x00007fa09d9aa577 in rcl_wait () from /opt/ros/rolling/lib/librcl.so
#5 0x00007fa09e00431e in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)
() from /opt/ros/rolling/lib/librclcpp.so
#6 0x00007fa09e0047d3 in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) () from /opt/ros/rolling/lib/librclcpp.so
#7 0x00007fa09e008ead in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/rolling/lib/librclcpp.so
#8 0x00007fa09e007f28 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) ()
from /opt/ros/rolling/lib/librclcpp.so
#9 0x00007fa09e0081d0 in rclcpp::spin(std::shared_ptr<rclcpp::Node>) () from /opt/ros/rolling/lib/librclcpp.so
#10 0x000055c20de0d492 in main (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/9/ext/atomicity.h:96
I can run ros2 service list and see
/controller_server/change_state
/controller_server/get_state
and if I try calling get_state, it appears to finish waiting, but it never gets a response from controller server:
$ ros2 service call /controller_server/get_state lifecycle_msgs/srv/GetState "{}"
waiting for service to become available...
requester: making request: lifecycle_msgs.srv.GetState_Request()
And I can't query via lifecycle
$ ros2 lifecycle get /controller_server
Node not found
It is important to note that this error condition does not happen every time. It is inconsistent.
Here is a first effort at a reproduction. It reproduces the problem only some of the time though. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38
@Aposhian One thing I don't understand. On the issue description you say this happens on rolling, but on the gist you just shared it uses ros:galactic as the base docker image.
So my question is: Does the issue only reproduce in galactic? Does it also reproduce on humble/rolling?
Note that Rolling if from binaries in 20.04 are outdated, since Rolling has moved to 22.04, so there be updates missing if not using 22.04 as the base or building from source
Yes, I am using old rolling on focal, so closer to galactic than to humble.
Yes, I cannot reproduce this issue on Humble images.
Keep in mind that Galactic is EOL in 5 months.
Since this is a galactic-specific problem, what do you want to do about it? Would it be possible to backport the patch @MiguelCompany? If not, we could try to patch something in just the galactic branch for you @Aposhian for the interim? If its just for Galactic to deal with just an issue with Fast-DDS, I'm OK being a bit more sloppy about it than I normally would since it's lifecycle is very limited and would not be pushed into future distributions -- the maintenance cost and technical reach is limited.
@MiguelCompany suggested I try building FastDDS from source on the 2.3.x branch, which has unreleased fixes for galactic. I don't think any Nav2 action is required. I will update if that is a viable workaround for galactic.
OK - if that fixes things feel free to close when ready!
Ok, I tried building from 2.3.x FastDDS, but it doesn't resolve the issue. https://gist.github.com/Aposhian/043359e09a203900e8db55407a8b5e38
rmw: rmw_cyclonedds_cpp | daemon: ✅
Try #1: ✅
Try #2: ✅
Try #3: ✅
rmw: rmw_cyclonedds_cpp | daemon: ❌
Try #1: ✅
Try #2: ✅
Try #3: ✅
rmw: rmw_fastrtps_cpp | daemon: ✅
Try #1: ❌
Try #2: ✅
Try #3: ✅
rmw: rmw_fastrtps_cpp | daemon: ❌
Try #1: ✅
Try #2: ✅
Try #3: ✅
@MiguelCompany while not specifically for Nav2, lifecycle services failing is still happening on humble intermittently. Can we migrate/copy this issue to Fast-DDS?
Agreed, some private conversations I've had in recent weeks have also expressed issues with services in Humble
@MiguelCompany what's the next step / status on this issue?
Any updates? This is a pretty critical issue and trying to find if we need to take action to resolve it :smile:
In the same theme of #3032, I want to get this off the queue so trying to find out the actionable things we can do to resolve this. I know the main issue is Fast-DDS not handling lots of service calls well, but I suppose we can improve the way these failures occur. The lifecycle manager and BT Action Node in BT Navigator are the only 2 places where we have waits without timeouts intentionally since these are pretty critical operations.
For this (lifecycle manger), what should happen if a service for transitioning doesn't come up? Fail everything? I suppose we could also remove the checks for services being up in the constructors https://github.com/ros-planning/navigation2/blob/main/nav2_util/src/lifecycle_service_client.cpp#L39 knowing that we check if they're up before we submit requests in the change/get methods that were recently added due to supporting respawning. I'll say though that the current status of lifecycle services and the manager are in a stable but thoughtfully setup balance due to supporting Bond, Respawn, and Lifecycle all at the same time. Removing that could cause some real headaches other places, but I won't be sure until I try if that seems to be the best answer.
Like in the other ticket though, I think the priority should be on working with Fast-DDS to be able to resolve this issue outright.
Yeah I'm torn as to whether this is something that Nav2 should consider to fallible as a part of normal operation, and gracefully handle it. But once you start to question the underlying RMW, lots of other assumptions go out the window. Even if we don't add in a configurable timeout, we could probably make the log messages more explicit as to what it is waiting on, so there is no ambiguity like "Waiting for lifecycle service..." or something like that.
Hi @Aposhian @SteveMacenski
Since you also said this is happening on Humble, we have moved our investigations there with no luck so far. Using your reproducer on Humble yields no results, being able to run hundreds of times without error.
Do you have any other clue on how to have this issue reproduced on Humble?
We also have this PR open that aims to bring Fast DDS's Waitsets to the RMW. It's still under review but you could try using it and see if the situation improves.
When I get time, I can try with that rmw_fastrtps branch. I don't have a more comprehensive reproduction for humble issues at this time.
I've seen this on rare occasions on my side (maybe only 1-2x since moving to Humble/Rolling on 22.04) when doing R&D. I can't say if its been happening or not before then, I was using the default Galactic RMW of Cyclone since I like to dogfood exactly what users will experience out of the box
Even if we don't add in a configurable timeout, we could probably make the log messages more explicit as to what it is waiting on, so there is no ambiguity like "Waiting for lifecycle service..." or something like that.
I suppose we could change
get_state_.wait_for_service();
to
while (!get_state_.wait_for_service(2)) { sleep 100ms; print;}
Would that help? But like I said, we could just remove the wait for service outright in the constructor of the lifecycle service client. If you think its better to just not block at all initially, I could make that change in a PR and make sure that the (1) bond (2) repawn (3) lifecycle continues to work and throw it to you in a PR to see if it masks the issue well enough. When errors happen though, they'd be more likely during steady state runtime rather than being at startup. Personally, I'd rather know about issues immediately to block startup than get going and find out interfaces to other ROS servers just crash and burn
I think that sounds reasonable. However, we won't be much use testing this at the moment, since we have decided to use CycloneDDS with Humble for our main operation at the moment.