rclcpp icon indicating copy to clipboard operation
rclcpp copied to clipboard

Possible regression in rcl preshutdown callbacks - context invalid?

Open SteveMacenski opened this issue 1 year ago • 11 comments

Bug report

Required Info:

  • Operating System:
    • 24.04
  • Installation type:
    • ROS 2 Jazzy, Rolling
  • Version or commit hash:
    • 28.2.0-1noble.20240430.174609 (from docker)
  • DDS implementation:
    • N/A
  • Client library (if applicable):
    • rclcpp

Description

I've noticed in the Nav2 transition in both unit tests and now at the system level that there appears to be a regression in the handling of the context and the pre-shutdown callbacks starting with the transition to 24.04 and seen in both Jazzy and current rolling. There has been some extended discussion wrt the unit tests in the comments:

  • https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2121319831
  • https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2121351566
  • https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2125809383

But I'm also now seeing it on shutdown of Nav2 at the system level now that we have Gazebo Harmonic working with Nav2. We run our rcl preshutdown callbacks to transition the nodes that may not have been properly transitioned to shutdown before Control+C was hit. This is executing as expected with logs:

[component_container_isolated-8] [INFO] [1717089793.781486696] [controller_server]: Running Nav2 LifecycleNode rcl preshutdown (controller_server)
[component_container_isolated-8] [INFO] [1717089793.781516996] [controller_server]: Destroying bond (controller_server) to lifecycle manager.
[component_container_isolated-8] [INFO] [1717089793.781521867] [smoother_server]: Running Nav2 LifecycleNode rcl preshutdown (smoother_server)
[component_container_isolated-8] [INFO] [1717089793.781526590] [smoother_server]: Destroying bond (smoother_server) to lifecycle manager.
...

But then also contain traces from rcl which seem related to publishers failing to do their thing because the context is invalid at the stage of pre-shutdown. We see only publisher failures when we have the system properly cycled down, but if we have the Nav2 lifecycle nodes still active so services are called to transition them, then we see the service eq as well.

[component_container_isolated-8] [INFO] [1717089793.874960887] [controller_server]: Destroying
[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368
[component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)
...

I clipped the logs to show one server in particular, but this repeats for all instances.

This can be reproduced in Nav2 using the 24.04/Rolling + the simulator in https://github.com/ros-navigation/navigation2/pull/3634. Drive around for a moment, stop, and hit control+C. We previously had a clean shutdown without crash. It can also be more minimally reproduced in the PR https://github.com/ros-navigation/navigation2/pull/4298/files TEST(WPTest, test_dynamic_parameters) test changes -- where we manually shut things down & reset the object internally to the test to avoid the need for the rcl pre-shutdown to do work, thereby bypassing it

SteveMacenski avatar May 30 '24 17:05 SteveMacenski

I don't know if its related, but also point out this issue as well discovered which points to issues in spin() exiting cleanly when Control+C is hit so that rclcpp::shutdown or others can be called once the node is kicked out. It seems related, but I can file a separate ticket for it. Similarly to that @tonynajjar has another ticket on spin_some not working the first time its called in some cases.

SteveMacenski avatar May 30 '24 17:05 SteveMacenski

@SteveMacenski

https://github.com/ros2/rclcpp/pull/2545 can fix that graceful shutdown case, and generated by https://github.com/ros2/rclcpp/pull/2528. (sorry...)

fujitatomoya avatar May 30 '24 17:05 fujitatomoya

No worries, shit happens. Another good reason Nav2's around using all these bells and whistles as a canary in the coalmine. Only used up like a day on this so not bad at all in the scheme of things. Lovely that there's already a solution! Thanks @fujitatomoya

SteveMacenski avatar May 30 '24 18:05 SteveMacenski

https://github.com/ros2/rclcpp/pull/2545 fixes LifecycleNode to avoid leaving in unknow state, if the object is destroyed. (originally this is issued https://github.com/ros2/rclcpp/issues/997).

But if the context is gracefully shutdown (e.g deferred signal handler via SIGINT, which is the exactly case here), the context is already shutdown, and that makes the lifecycle node dtor fails to publish the transition (lifecyclenode should be shutdown to avoid unknow state for sensors and devices but fails since context is not valid anymore). https://github.com/ros2/rclcpp/pull/2545 bypass this situation at this moment, shutdown is called only when the context is valid. this is tagged with TODO until we figure out how to address this condition.

https://github.com/ros2/rclcpp/pull/2545 CI is running now, once i have green light, i will merge it to rolling.

fujitatomoya avatar May 30 '24 18:05 fujitatomoya

BTW: this worked fine those pre-24.04 Rolling (22.04 Rolling didn't exhibit this) so there was something that changed that caused this, it was fully functioning for ~a year prior. I think my understanding of your response is that there's a bit of a catch-22, so mentioning this as a "I think there should be a solution, no?"

SteveMacenski avatar May 30 '24 18:05 SteveMacenski

@SteveMacenski i think there are 2 problems here,

[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368 [component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)

I am sure these messages are generated by https://github.com/ros2/rclcpp/pull/2528 as explained https://github.com/ros2/rclcpp/issues/2547#issuecomment-2140512199, and https://github.com/ros2/rclcpp/pull/2545 can fix it.

note: https://github.com/ros2/rclcpp/pull/2528 IS NOT merged to jazzy. (https://github.com/ros2/rclcpp/pull/2542 staging but merged), so we should not meet these message with jazzy.

24.04 and seen in both Jazzy and current rolling. There has been some extended discussion wrt the unit tests in the comments:

  • https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2121319831
  • https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2121351566
  • https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2125809383

these 3 issues are with rmw_fastrtps, right? this looks the same problem with https://github.com/ros2/rmw_fastrtps/issues/761

fujitatomoya avatar May 30 '24 18:05 fujitatomoya

The second comment is definitely the same as this.

The first is perhaps? Its also another 'things failing on shutdown' example and you see similar logs with the pre-shutdown (the third is a summarized version of the same). I didn't assume it was a Fast-DDS specific problem, but :shrug:

    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

SteveMacenski avatar May 30 '24 20:05 SteveMacenski

@SteveMacenski

https://github.com/ros2/rclcpp/pull/2545 has been merged to rolling, the following waning and error should not be printed even with graceful shutdown.

[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368 [component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)

fujitatomoya avatar May 30 '24 22:05 fujitatomoya

Great! Currently in the middle of other firefighting with getting Nav2 out for Jazzy, but I'll test this out once the nightly docker containers update and I'm between burning buildings. If you're confident that it addresses the issue, feel free to close

SteveMacenski avatar May 30 '24 23:05 SteveMacenski

@SteveMacenski after https://github.com/ros2/rclcpp/pull/2545 is merged in rolling, are you guys still having the other issue?

    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

fujitatomoya avatar Jun 06 '24 17:06 fujitatomoya

@fujitatomoya sorry for the delay, the 24.04 & GZ migrations took longer than we expected in Nav2-ville. Is there a docker base container available that contains these updates? If so, I can test this tomorrow and let you know!

SteveMacenski avatar Jul 10 '24 00:07 SteveMacenski

I believe this is outdated, sorry for not closing before!

SteveMacenski avatar Jan 30 '25 17:01 SteveMacenski