rosbag2 icon indicating copy to clipboard operation
rosbag2 copied to clipboard

Improve the reliability of rosbag2 tests

Open clalancette opened this issue 1 year ago • 15 comments

I initially started this series trying to track down a rare failing flakey test with test_record__rmw_cyclonedds_cpp. That particular flake seems to be able to happen because sometimes discovery takes longer than we expect, and it is possible that the tests "miss" the first publication. If that's the case, then the rest of the test may fail because it is expecting a certai number of messages. Along the way, we cleanup the tests a bit:

  1. Remove the unnecessary wait_until_shutdown method, which was almost exactly the same as wait_for_condition.
  2. Slightly revamp how the async spinner works, so it doesn't need to call rclcpp::shutdown.
  3. Do a similar revamp for the custom async spinner in rosbag2_tests.
  4. Wait for topics to be discovered right after calling recorder->record(). This ensures we can't get into the above situation.

With this series in place, the particular flake of test_record on rmw_cyclonedds_cpp is fixed (or, at least, I can no longer reproduce it). There are still other flakes that can happen under load, but I think fixes for those will have to come separately.

clalancette avatar Aug 28 '24 19:08 clalancette

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/77bebfb56ccf2c9d926ce6c8343b0773/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14483

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Aug 28 '24 19:08 clalancette

I will take a look. Anyway, hold off from the merging this for a while. We are trying to make a relase on rolling https://github.com/ros2/rosbag2/pull/1791

MichaelOrlov avatar Sep 03 '24 19:09 MichaelOrlov

  • Redundant because in all places we already have either ASSERT_TRUE(pub_manager.wait_for_matched(expecting_topic_name); or ASSERT_TRUE(client_manager_1->wait_for_service_to_be_ready());. Which is making sure that recorder have already been subscribed to the topics on which we are going to publish.

The problem is that this is not always true. There are at least two situations this does not cover: for the event topic associated with services (/service_name/_events), and for the event topic associated with the recorder splitting functionality (/event/split_write).

pub_manager_wait_for_matched(expecting_topic_name) will only wait for the named topic, which is a topic that the pub_manager is managing. That does not apply to the /service_name/_events topic that services use for events.

Additionally, client_manager->wait_for_service_to_be_ready() only waits for the service topics to be ready, not the event topic. So again, it does not apply to the /service_name/_events topic that services use for events.

That's why adding in the additional waits for recorder subscriptions makes the tests more reliable, since it handles both of these situations.

If aforementioned ASSERT_TRUE(pub_manager.wait_for_matched(expecting_topic_name); and ASSERT_TRUE(client_manager_1->wait_for_service_to_be_ready()); checks are not enough we have a better way to do a wait for the recorder discovery. In some unit tests we use MockRecorder with extended utility methods to facilitate waiting functionality.

But fair enough, I'll take a look at switching to this.

clalancette avatar Sep 04 '24 15:09 clalancette

All right, see 11bb864, which uses the MockRecorder methods instead.

clalancette avatar Sep 04 '24 16:09 clalancette

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/08cc831129805c683801a5ebf1f8b33a/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14511

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Sep 04 '24 16:09 clalancette

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/475d8bf4bfedcf19c2136589ee4451f9/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14514

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Sep 04 '24 20:09 clalancette

I'm curious what a "surprise" failure with rmw_connextdds on CI is.

MichaelOrlov avatar Sep 05 '24 01:09 MichaelOrlov

I'm curious what a "surprise" failure with rmw_connextdds on CI is.

Yeah, I don't know what is going on with that. The only change to that test is to wait for the clock topic to show up. I'll need to look deeper into it, though, since that seems to be a new failure with this PR.

clalancette avatar Sep 05 '24 12:09 clalancette

All right, I think I figured it out. I think the problem is related to the switch to exec.spin_some(), and the ongoing bug in rclcpp: https://github.com/ros2/rclcpp/issues/2589

However, it turns out that we actually don't need to use spin_some here; using just spin, along with a cancel later on, is good enough. So commit eb8db9e does exactly that, and fixes the problems in my local testing. I'll run another CI here to confirm.

clalancette avatar Sep 05 '24 13:09 clalancette

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/72c7e3cff95fa4a9f17085364bd78781/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14517

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Sep 05 '24 13:09 clalancette

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/0e69722137e215ba0dee2c7dc96e1a6c/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14518

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Sep 05 '24 15:09 clalancette

The only failures here are on Windows, and that has nothing to do with this PR (those will be fixed by https://github.com/ros2/common_interfaces/pull/256). So I think this is ready for final review now.

clalancette avatar Sep 05 '24 18:09 clalancette

@clalancette Curious what was behind decision to move explict call for the stop_spinning() to the TearDown() method?

So the first thing to realize is that because we are putting the spin call in a thread, we have to call stop_spinning() (which calls thread.join()), otherwise we get an exception from the std::thread destructor.

With that in mind, I could have used rcpputils::make_scope_exit to stop the spinning. However, by default that is actually no different than calling it in TearDown, since we wouldn't call that until the end of the function anyway.

I have a prototype locally where I did exactly what you said and added in new scopes, with the rcpputils::make_scope_exit, and that works fine. But it also ends up in a far larger change, because of those additional scopes and the reindentation of a lot of code. Let me know if you still want me to do that, and I can make the change.

clalancette avatar Sep 06 '24 12:09 clalancette

As something of a compromise, 404c78a adds in rcpputils::make_scope_exit calls without changing the scopes. This isn't too different from putting it in TearDown (which I've left as a backup), but it would allow us to change scopes later. Let me know what you think.

clalancette avatar Sep 06 '24 17:09 clalancette

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/1986838f6b0c6a14c9b1af14a8340640/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14549

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Sep 10 '24 16:09 clalancette

@ros-pull-request-builder retest this please

clalancette avatar Sep 11 '24 14:09 clalancette

@clalancette I was able to reproduce freeze locally on Ubuntu by increasing discovery wait interval up to the 500mS

rosbag2_transport::RecordOptions record_options =
  {true, false, false, {}, {}, {}, {}, {}, {}, "rmw_format", 500ms};

and repeatedly run only those RecordIntegrationTestFixture, test_keyboard_controls test.

I see in the debugger that freeze occurs in our modified stop_spinning() right on the spin_thread_.join();, exactly as you were suspecting.

It seems the problem is in the underlying SingleThreadedExecutor spin() and cancel() routines. It looks like cancel doen't really work as expected and we are not exiting from the spin().

MichaelOrlov avatar Sep 12 '24 01:09 MichaelOrlov

Pulls: ros2/rosbag2#1796 Gist: https://gist.githubusercontent.com/clalancette/5271579611710ffed73686f516a6ad9e/raw/cb4f2fb2fb0fac274fbef12b77b13699d0b0e232/ros2.repos BUILD args: --packages-above-and-dependencies rosbag2_test_common rosbag2_tests rosbag2_transport TEST args: --packages-above rosbag2_test_common rosbag2_tests rosbag2_transport ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14555

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

clalancette avatar Sep 12 '24 17:09 clalancette

Yep, this seems much happier now. Thanks again @MichaelOrlov , just need approval again then we can merge this in.

clalancette avatar Sep 12 '24 18:09 clalancette

https://github.com/Mergifyio backport jazzy

MichaelOrlov avatar Sep 12 '24 20:09 MichaelOrlov

backport jazzy

✅ Backports have been created

mergify[bot] avatar Sep 12 '24 20:09 mergify[bot]