Ubuntu 22.04 BUG: Failed to delete datawriter in nav2_util tests
Bug report
Required Info:
- Operating System:
- Ubuntu 22.04
- Installation type:
- Binaries
- Version or commit hash:
- ros-rolling-rmw 7.2.1-1jammy.20231004.150642
- DDS implementation:
- default
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
Follow the NAV2 building from source instructions, with the repo on latest rolling hash 113564965f54009686d521902ff3fcc9d101c5b5.
colcon build --mixin debug --packages-up-to nav2_util
. install/setup.bash
colcon test --packages-select nav2_util --event-handlers=console_cohesion+ --ctest-args " -R" " test_actions"
If you want to build everything from source, you need to check out ros rolling sources, nav2 sources, and clone this in the workspace:
- https://github.com/ros-geographic-info/geographic_info/tree/ros2
- [email protected]:ros/bond_core.git on
ros2branch
Expected behavior
Tests pass
Actual behavior
RMW has an internal error failing to delete the datawriter in test_actions. I already talked to the NAV2 maintainers (Steve) and he said to file the issue here. He sees it in CI, but he says since I can reproduce it locally, to file it ASAP.
Start 12: test_actions
12: Test command: /usr/bin/python3.10 "-u" "/opt/ros/rolling/share/ament_cmake_test/cmake/run_test.py" "/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml" "--package-name" "nav2_util" "--output-file" "/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/ament_cmake_gtest/test_actions.txt" "--command" "/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test/test_actions" "--gtest_output=xml:/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml"
12: Test timeout computed to be: 60
12: -- run_test.py: invoking following command in '/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test':
12: - /home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test/test_actions --gtest_output=xml:/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml
12: [==========] Running 7 tests from 1 test suite.
12: [----------] Global test environment set-up.
12: [----------] 7 tests from ActionTest
12: [ RUN ] ActionTest.test_simple_action
12: [INFO] [1702060361.370104450] [fibonacci_server_node]: Activating
12: Teardown
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_simple_action (1119 ms)
12: [ RUN ] ActionTest.test_simple_action_with_feedback
12: Teardown
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_simple_action_with_feedback (906 ms)
12: [ RUN ] ActionTest.test_simple_action_activation_cycling
12: [INFO] [1702060363.383970802] [fibonacci_server_node]: Deactivating
12: [WARN] [1702060363.383992725] [fibonacci_server_node]: [fibonacci] [ActionServer] Requested to deactivate server but goal is still executing. Should check if action server is running before deactivating.
12: [INFO] [1702060363.484062276] [fibonacci_server_node]: [fibonacci] [ActionServer] Waiting for async process to finish.
12: [WARN] [1702060363.484068138] [fibonacci_server_node]: [fibonacci] [ActionServer] Stopping the thread per request.
12: [WARN] [1702060363.484149679] [fibonacci_server_node]: [fibonacci] [ActionServer] Aborting handle.
12: Sent goal, spinning til complete...
12: [INFO] [1702060363.492394679] [fibonacci_server_node]: Activating
12: Getting result, spinning til complete...
12: Teardown
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_simple_action_activation_cycling (1215 ms)
12: [ RUN ] ActionTest.test_simple_action_preemption
12: Sent goal, spinning til complete...
12: Sent goal, spinning til complete...
12: Getting result, spinning til complete...
12: Teardown
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_simple_action_preemption (106 ms)
12: [ RUN ] ActionTest.test_simple_action_preemption_after_succeeded
12: [INFO] [1702060364.705742069] [fibonacci_server_node]: Ignoring preemptions
12: Teardown
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_simple_action_preemption_after_succeeded (1906 ms)
12: [ RUN ] ActionTest.test_handle_goal_deactivated
12: [INFO] [1702060366.612432927] [fibonacci_server_node]: Deactivating
12: Teardown
12: [INFO] [1702060366.612516663] [fibonacci_server_node]: Activating
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_handle_goal_deactivated (6 ms)
12: [ RUN ] ActionTest.test_handle_cancel
12: Teardown
12: Teardown...
12: Teardown complete
12: [ OK ] ActionTest.test_handle_cancel (6 ms)
12: [----------] 7 tests from ActionTest (5266 ms total)
12:
12: [----------] Global test environment tear-down
12: [==========] 7 tests from 1 test suite ran. (5266 ms total)
12: [ PASSED ] 7 tests.
12: [WARN] [1702060366.621645841] [fibonacci_server_node]: [fibonacci] [ActionServer] Current goal was not completed successfully.
12: [WARN] [1702060366.621658776] [fibonacci_server_node]: [fibonacci] [ActionServer] Client requested to cancel the goal. Cancelling.
12:
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12:
12: 'feedback publisher is invalid, at ./src/rcl_action/action_server.c:936'
12:
12: with this new error message:
12:
12: 'Fail in delete datawriter, at ./src/rmw_service.cpp:115'
12:
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12:
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12:
12: 'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12:
12: with this new error message:
12:
12: 'cannot publish data, at ./src/rmw_publish.cpp:62'
12:
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12:
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12:
12: 'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12:
12: with this new error message:
12:
12: 'cannot publish data, at ./src/rmw_publish.cpp:62'
12:
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12:
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12:
12: 'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12:
12: with this new error message:
12:
12: 'cannot publish data, at ./src/rmw_publish.cpp:62'
12:
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
12: Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622855928] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622872742] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622885286] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: [ERROR] [1702060366.622898322] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:199
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622925316] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622944834] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622964233] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622982479] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.622998631] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.623017288] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12: Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
12: [ERROR] [1702060366.623032588] [fibonacci_server_node.rclcpp]: Error in destruction of rcl publisher handle: cannot publish data, at ./src/rmw_publish.cpp:62, at ./src/rcl/publisher.c:201
12: cannot publish data, at ./src/rmw_publish.cpp:62 during '__function__'
12: Fail in delete datareader, at ./src/rmw_service.cpp:100 during '__function__'
12: [ERROR] [1702060366.623059261] [fibonacci_server_node]: Error in shutdown of get_type_description service: Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262
12:
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12:
12: 'Fail in delete datawriter, at ./src/rmw_service.cpp:115, at ./src/rcl/service.c:262'
12:
12: with this new error message:
12:
12: 'cannot publish data, at ./src/rmw_publish.cpp:62'
12:
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12:
12: >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
12: This error state is being overwritten:
12:
12: 'cannot publish data, at ./src/rmw_publish.cpp:62'
12:
12: with this new error message:
12:
12: 'Failed to delete datareader, at ./src/subscription.cpp:52'
12:
12: rcutils_reset_error() should be called after error handling to avoid this.
12: <<<
12: __function__:80: 'destroy_subscription' failed
12: test_actions: pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.
12: -- run_test.py: return code -6
12: -- run_test.py: inject classname prefix into gtest result file '/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml'
12: -- run_test.py: verify result file '/home/ryan/Dev/ros2_ws/src/navigation2/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml'
12/19 Test #12: test_actions .....................***Failed 5.39 sec
test 13
Additional information
CPU: AMD® Ryzen 9 7950x 16-core processor × 32
GPU: NVIDIA GeForce RTX 3070/PCIe/SSE2 / NVIDIA Corporation GA104 [GeForce RTX 3070]
$ uname -a
Linux B650-970 6.2.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Oct 6 10:23:26 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
We see this in Nav2's CI periodically as well, as reported on the Slack (@clalancette ) It is pretty concerning to us that we again have new RMW errors appearing for really basic tests. This is a new thing in the last few months
@SteveMacenski @Ryanf55
This issue has been there for a long time? Or just stated failing recently? if you happen know which commit causes this error behavior on Nav2, that would be really helpful.
[ERROR] [1702060366.623059261] [fibonacci_server_node]: Error in shutdown of get_type_description service: Fail in delete datawriter
could be related to Type Description Support... but not sure.
CC: @iuhilnehc-ynos @Barry-Xu-2018
It’s not deterministic in our CI, so I can’t pinpoint the exact commit. Its been the last few months though.
@SteveMacenski @Ryanf55
This issue has been there for a long time? Or just stated failing recently? if you happen know which commit causes this error behavior on Nav2, that would be really helpful.
[ERROR] [1702060366.623059261] [fibonacci_server_node]: Error in shutdown of get_type_description service: Fail in delete datawriter
could be related to
Type Description Support... but not sure.CC: @iuhilnehc-ynos @Barry-Xu-2018
I'll try to set up a source build of rmw to run a bisect on when I get home in a few days.
Sorry, I haven't had time to do this yet. It might be easier for me to just debug this with gdb because a bisect involves recompiling ros2, then nav2, which takes a while each iteration (158 packages using --packages-up-to).
Here's the bisect report:
- [ ] 6.0.0 - doesn't compile in rmw_dds_common
- [ ] 7.0.0 - doesn't compile in rmw_dds_common
/home/ryan/Dev/ros2_rolling/src/ros2/rmw_dds_common/rmw_dds_common/include/rmw_dds_common/qos.hpp:242:3: error: ‘rosidl_type_hash_t’ has not been declared
- [ ] 7.2.2 - test fails for the same reason
- [ ] 7.1.0 - test fails for same reason
If I want to bisect this further, I'll need to start doing the bisect across multiple repos since they are coupled together due to the public API additions. Are there any tools to roll back state across multiple repos, say with vcs? Or, do I manually need to figure out which hashes of each repo correspond to each other?
@SteveMacenski Does NAV2 CI give you any way to look at test regressions per test to see the first date that test started failing? I know cdash has regression reporting capabilities.
From the CDash blog, it looks like you can look at test failure frequency, but not over time. https://circleci.com/blog/how-to-output-junit-tests-through-circleci-2-0-for-expanded-insights/
Unless CircleCI produces that, we dont have any unique tools
I encountered very similar error, https://github.com/mavlink/mavros/actions/runs/8129217925/job/22234646274 the same code works on Humble, but fails on Iron.