rosbag2
rosbag2 copied to clipboard
Improve the reliability of rosbag2 tests
I initially started this series trying to track down a rare failing flakey test with test_record__rmw_cyclonedds_cpp. That particular flake seems to be able to happen because sometimes discovery takes longer than we expect, and it is possible that the tests "miss" the first publication. If that's the case, then the rest of the test may fail because it is expecting a certai number of messages. Along the way, we cleanup the tests a bit:
- Remove the unnecessary
wait_until_shutdownmethod, which was almost exactly the same aswait_for_condition. - Slightly revamp how the async spinner works, so it doesn't need to call
rclcpp::shutdown. - Do a similar revamp for the custom async spinner in rosbag2_tests.
- Wait for topics to be discovered right after calling
recorder->record(). This ensures we can't get into the above situation.
With this series in place, the particular flake of test_record on rmw_cyclonedds_cpp is fixed (or, at least, I can no longer reproduce it). There are still other flakes that can happen under load, but I think fixes for those will have to come separately.
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/77bebfb56ccf2c9d926ce6c8343b0773/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14483
I will take a look. Anyway, hold off from the merging this for a while. We are trying to make a relase on rolling https://github.com/ros2/rosbag2/pull/1791
- Redundant because in all places we already have either
ASSERT_TRUE(pub_manager.wait_for_matched(expecting_topic_name);orASSERT_TRUE(client_manager_1->wait_for_service_to_be_ready());. Which is making sure that recorder have already been subscribed to the topics on which we are going to publish.
The problem is that this is not always true. There are at least two situations this does not cover: for the event topic associated with services (/service_name/_events), and for the event topic associated with the recorder splitting functionality (/event/split_write).
pub_manager_wait_for_matched(expecting_topic_name) will only wait for the named topic, which is a topic that the pub_manager is managing. That does not apply to the /service_name/_events topic that services use for events.
Additionally, client_manager->wait_for_service_to_be_ready() only waits for the service topics to be ready, not the event topic. So again, it does not apply to the /service_name/_events topic that services use for events.
That's why adding in the additional waits for recorder subscriptions makes the tests more reliable, since it handles both of these situations.
If aforementioned
ASSERT_TRUE(pub_manager.wait_for_matched(expecting_topic_name);andASSERT_TRUE(client_manager_1->wait_for_service_to_be_ready());checks are not enough we have a better way to do a wait for the recorder discovery. In some unit tests we useMockRecorderwith extended utility methods to facilitate waiting functionality.
But fair enough, I'll take a look at switching to this.
All right, see 11bb864, which uses the MockRecorder methods instead.
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/08cc831129805c683801a5ebf1f8b33a/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14511
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/475d8bf4bfedcf19c2136589ee4451f9/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14514
I'm curious what a "surprise" failure with rmw_connextdds on CI is.
I'm curious what a "surprise" failure with rmw_connextdds on CI is.
Yeah, I don't know what is going on with that. The only change to that test is to wait for the clock topic to show up. I'll need to look deeper into it, though, since that seems to be a new failure with this PR.
All right, I think I figured it out. I think the problem is related to the switch to exec.spin_some(), and the ongoing bug in rclcpp: https://github.com/ros2/rclcpp/issues/2589
However, it turns out that we actually don't need to use spin_some here; using just spin, along with a cancel later on, is good enough. So commit eb8db9e does exactly that, and fixes the problems in my local testing. I'll run another CI here to confirm.
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/72c7e3cff95fa4a9f17085364bd78781/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14517
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/0e69722137e215ba0dee2c7dc96e1a6c/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14518
The only failures here are on Windows, and that has nothing to do with this PR (those will be fixed by https://github.com/ros2/common_interfaces/pull/256). So I think this is ready for final review now.
@clalancette Curious what was behind decision to move explict call for the
stop_spinning()to theTearDown()method?
So the first thing to realize is that because we are putting the spin call in a thread, we have to call stop_spinning() (which calls thread.join()), otherwise we get an exception from the std::thread destructor.
With that in mind, I could have used rcpputils::make_scope_exit to stop the spinning. However, by default that is actually no different than calling it in TearDown, since we wouldn't call that until the end of the function anyway.
I have a prototype locally where I did exactly what you said and added in new scopes, with the rcpputils::make_scope_exit, and that works fine. But it also ends up in a far larger change, because of those additional scopes and the reindentation of a lot of code. Let me know if you still want me to do that, and I can make the change.
As something of a compromise, 404c78a adds in rcpputils::make_scope_exit calls without changing the scopes. This isn't too different from putting it in TearDown (which I've left as a backup), but it would allow us to change scopes later. Let me know what you think.
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/1986838f6b0c6a14c9b1af14a8340640/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14549
@ros-pull-request-builder retest this please
@clalancette I was able to reproduce freeze locally on Ubuntu by increasing discovery wait interval up to the 500mS
rosbag2_transport::RecordOptions record_options =
{true, false, false, {}, {}, {}, {}, {}, {}, "rmw_format", 500ms};
and repeatedly run only those RecordIntegrationTestFixture, test_keyboard_controls test.
I see in the debugger that freeze occurs in our modified stop_spinning() right on the spin_thread_.join();, exactly as you were suspecting.
It seems the problem is in the underlying SingleThreadedExecutor spin() and cancel() routines. It looks like cancel doen't really work as expected and we are not exiting from the spin().
Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/5271579611710ffed73686f516a6ad9e/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14555
Yep, this seems much happier now. Thanks again @MichaelOrlov , just need approval again then we can merge this in.
https://github.com/Mergifyio backport jazzy
backport jazzy
✅ Backports have been created
- #1806 Improve the reliability of rosbag2 tests (backport #1796) has been created for branch
jazzybut encountered conflicts