rmw_connext Flaky Test: rcl

Bug report

Required Info:

Operating System:
- Windows 10 Release
Installation type:
- Nightly Windows CI
Version or commit hash:
- As early as: https://ci.ros2.org/job/nightly_win_rel/1721/testReport/junit/(root)/projectroot/test_graph__rmw_connext_cpp/
DDS implementation:
- RTI Connext
Client library (if applicable):
- rcl_action

Steps to reproduce issue

Expected behavior

Test passes consistently

Actual behavior

Test does not pass consistently

Additional information

[ RUN      ] TestActionGraphMultiNodeFixture.action_client_init_maybe_fail
C:\ci\ws\src\ros2\rcl\rcl_action\test\rcl_action\test_graph.cpp(359): error: Expected: (attempts) <= (max_attempts), actual: 5 vs 4
Unable to attain all required nodes
[  FAILED  ] TestActionGraphMultiNodeFixture.action_client_init_maybe_fail (14432 ms)

[ RUN      ] TestActionGraphMultiNodeFixture.rcl_get_server_names_and_types_by_node_maybe_fail
C:\ci\ws\src\ros2\rcl\rcl_action\test\rcl_action\test_graph.cpp(359): error: Expected: (attempts) <= (max_attempts), actual: 5 vs 4
Unable to attain all required nodes
[  FAILED  ] TestActionGraphMultiNodeFixture.rcl_get_server_names_and_types_by_node_maybe_fail (14464 ms)

Oct 20 '20 16:10 mjcarroll

This is one of the few remaining test failures in CI right now. As @mjcarroll said, the earliest indication of this is in https://ci.ros2.org/view/nightly/job/nightly_win_rel/1721/ . Looking more closely at the test failure, it always happens while testing the rmw_connext version of the tests:

Test command: C:\Python38\python.exe "-u" "C:/ci/ws/install/share/ament_cmake_test/cmake/run_test.py" "C:/ci/ws/build/rcl_action/test_results/rcl_action/test_graph__rmw_connext_cpp.gtest.xml" "--package-name" "rcl_action" "--output-file" "C:/ci/ws/build/rcl_action/ament_cmake_gtest/test_graph__rmw_connext_cpp.txt" "--env" "RCL_ASSERT_RMW_ID_MATCHES=rmw_connext_cpp" "RMW_IMPLEMENTATION=rmw_connext_cpp" "--command" "C:/ci/ws/build/rcl_action/Release/test_graph__rmw_connext_cpp.exe" "--gtest_output=xml:C:/ci/ws/build/rcl_action/test_results/rcl_action/test_graph__rmw_connext_cpp.gtest.xml"

So I went back and looked at what got merged the day before build 1721 (October 13), with this query.

Of the 9 things that were merged on October 13, the one that sticks out to me the most is https://github.com/ros2/rmw_connext/pull/460 . It is the only one that touches rmw_connext, which is consistent with the failure we are seeing here. Next step will be to see what could cause this to fail randomly like it currently is.

Dec 16 '20 17:12 clalancette

All right, I reverted #460 and tried another build.

Before: * Windows

After: * Windows

So it definitely looks like that PR has something to do with it.

Dec 16 '20 19:12 clalancette

All right, I also ran an experimental job where I rewrote https://github.com/ros2/rmw_connext/blob/3a0baf5e92933ebe657d4edeb57e7981da5b60f9/rmw_connext_shared_cpp/src/init.cpp#L74 to not be std::call_once, but instead call it for every new node (or context, it is not entirely clear to me). You can see the results of that job here:

Windows

Since that one also passed, I think it has something to do with that part of #460. @ivanpauno Could you maybe give a bit of background on why you changed the init there to be call_once?

Dec 16 '20 21:12 clalancette

Since that one also passed, I think it has something to do with that part of #460. @ivanpauno Could you maybe give a bit of background on why you changed the init there to be call_once?

That function is modifying some settings in the DomainParticipantFactory, which is a singleton object that lives "for ever". Setting those options more than once doesn't make sense, and also some of the setters we're using aren't thread safe, so calling them inside a call_once lambda makes init thread safe.

I'm really failing to see how the use of std::call_once can make this test more flaky, it seems just like an unfortunate coincidence to me.

Dec 17 '20 13:12 ivanpauno

I'm really failing to see how the use of std::call_once can make this test more flaky, it seems just like an unfortunate coincidence to me.

I really do think this is the problem, though. In the unmodified failing build above, I ran it with --retest-until-fail 2, and that was enough to cause it to fail. In the two succeeding tests (one with #460 completely reverted, and one just getting rid of std::call_once), I ran it with --retest-until-fail 10, and they both passed every time. It could be an extreme flake, but it really does seem to me that something about only initializing once is causing the problem. Maybe this is just exposing another bug, I'm not sure.

Dec 17 '20 13:12 clalancette

I really do think this is the problem, though. In the unmodified failing build above, I ran it with --retest-until-fail 2, and that was enough to cause it to fail. In the two succeeding tests (one with #460 completely reverted, and one just getting rid of std::call_once), I ran it with --retest-until-fail 10, and they both passed every time. It could be an extreme flake, but it really does seem to me that something about only initializing once is causing the problem. Maybe this is just exposing another bug, I'm not sure.

I see. I'm more comfortable with first trying to extend the timeout of the test (number of attempts here), and if that doesn't work I don't mind if the std::call_once is removed or the test is marked as xfail.

IMO, this may be a case of spooky action at distance, maybe some of the API that init() is calling forces all participants of the participant factory to somehow "sync/update" their graphs (as a magic side effect :sparkles:), but I don't really think there's something wrong with using std::once here.

It would be good if someone else can check if the usage of std::once is correct here, maybe @hidmic ?

Dec 17 '20 15:12 ivanpauno

if the usage of std::once is correct here

Use of std::call_once seems correct to me too.

IMO, this may be a case of spooky action at distance

Hmm, I wouldn't be surprised but that doesn't explain why it only manifests itself on Windows. It smells like an msvc bug to me. @clalancette is std::call_once actually calling the function?

Dec 22 '20 20:12 hidmic

Hmm, I wouldn't be surprised but that doesn't explain why it only manifests itself on Windows. It smells like an msvc bug to me. @clalancette is std::call_once actually calling the function?

I'm honestly not sure, I didn't instrument it. Unfortunately, I don't have an easy way to reproduce this at the moment. I'll leave this on my list to take a look at in the new year.

Dec 22 '20 21:12 clalancette

Hmm, I wouldn't be surprised but that doesn't explain why it only manifests itself on Windows.

In CI, it happens frequently that flaky tests are more prone to fail on Windows (and with rmw_connext).

It smells like an msvc bug to me.

That sounds unlikely to me ...

Dec 22 '20 22:12 ivanpauno

In CI, it happens frequently that flaky tests are more prone to fail on Windows (and with rmw_connext).

For time sensitive tests, yes, but this looks different (considering above evidence of this working just fine if std::call_once is dropped).

That sounds unlikely to me ...

It could be a flaw in RTI Connext, as you suggest, but I'd be surprised. It seems simpler for DomainParticipantFactory to do things right than to do them wrong.

Dec 23 '20 12:12 hidmic

Closing, since with Foxy now being End-of-Life, this repository is no longer used.

Jun 21 '23 20:06 clalancette

rmw_connext
rmw_connext copied to clipboard

Flaky Test: rcl_action TestAction

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

rmw_connext rmw_connext copied to clipboard

Flaky Test: rcl_action TestAction

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

rmw_connext
rmw_connext copied to clipboard