rmw_connext
rmw_connext copied to clipboard
Flaky Test: rcl_action TestAction
Bug report
Required Info:
- Operating System:
- Windows 10 Release
- Installation type:
- Nightly Windows CI
- Version or commit hash:
- As early as: https://ci.ros2.org/job/nightly_win_rel/1721/testReport/junit/(root)/projectroot/test_graph__rmw_connext_cpp/
- DDS implementation:
- RTI Connext
- Client library (if applicable):
- rcl_action
Steps to reproduce issue
Expected behavior
Test passes consistently
Actual behavior
Test does not pass consistently
Additional information
[ RUN ] TestActionGraphMultiNodeFixture.action_client_init_maybe_fail
C:\ci\ws\src\ros2\rcl\rcl_action\test\rcl_action\test_graph.cpp(359): error: Expected: (attempts) <= (max_attempts), actual: 5 vs 4
Unable to attain all required nodes
[ FAILED ] TestActionGraphMultiNodeFixture.action_client_init_maybe_fail (14432 ms)
[ RUN ] TestActionGraphMultiNodeFixture.rcl_get_server_names_and_types_by_node_maybe_fail
C:\ci\ws\src\ros2\rcl\rcl_action\test\rcl_action\test_graph.cpp(359): error: Expected: (attempts) <= (max_attempts), actual: 5 vs 4
Unable to attain all required nodes
[ FAILED ] TestActionGraphMultiNodeFixture.rcl_get_server_names_and_types_by_node_maybe_fail (14464 ms)
This is one of the few remaining test failures in CI right now. As @mjcarroll said, the earliest indication of this is in https://ci.ros2.org/view/nightly/job/nightly_win_rel/1721/ . Looking more closely at the test failure, it always happens while testing the rmw_connext
version of the tests:
Test command: C:\Python38\python.exe "-u" "C:/ci/ws/install/share/ament_cmake_test/cmake/run_test.py" "C:/ci/ws/build/rcl_action/test_results/rcl_action/test_graph__rmw_connext_cpp.gtest.xml" "--package-name" "rcl_action" "--output-file" "C:/ci/ws/build/rcl_action/ament_cmake_gtest/test_graph__rmw_connext_cpp.txt" "--env" "RCL_ASSERT_RMW_ID_MATCHES=rmw_connext_cpp" "RMW_IMPLEMENTATION=rmw_connext_cpp" "--command" "C:/ci/ws/build/rcl_action/Release/test_graph__rmw_connext_cpp.exe" "--gtest_output=xml:C:/ci/ws/build/rcl_action/test_results/rcl_action/test_graph__rmw_connext_cpp.gtest.xml"
So I went back and looked at what got merged the day before build 1721 (October 13), with this query.
Of the 9 things that were merged on October 13, the one that sticks out to me the most is https://github.com/ros2/rmw_connext/pull/460 . It is the only one that touches rmw_connext
, which is consistent with the failure we are seeing here. Next step will be to see what could cause this to fail randomly like it currently is.
All right, I reverted #460 and tried another build.
So it definitely looks like that PR has something to do with it.
All right, I also ran an experimental job where I rewrote https://github.com/ros2/rmw_connext/blob/3a0baf5e92933ebe657d4edeb57e7981da5b60f9/rmw_connext_shared_cpp/src/init.cpp#L74 to not be std::call_once
, but instead call it for every new node (or context, it is not entirely clear to me). You can see the results of that job here:
Since that one also passed, I think it has something to do with that part of #460. @ivanpauno Could you maybe give a bit of background on why you changed the init
there to be call_once
?
Since that one also passed, I think it has something to do with that part of #460. @ivanpauno Could you maybe give a bit of background on why you changed the init there to be call_once?
That function is modifying some settings in the DomainParticipantFactory
, which is a singleton object that lives "for ever".
Setting those options more than once doesn't make sense, and also some of the setters we're using aren't thread safe, so calling them inside a call_once
lambda makes init
thread safe.
I'm really failing to see how the use of std::call_once
can make this test more flaky, it seems just like an unfortunate coincidence to me.
I'm really failing to see how the use of
std::call_once
can make this test more flaky, it seems just like an unfortunate coincidence to me.
I really do think this is the problem, though. In the unmodified failing build above, I ran it with --retest-until-fail 2
, and that was enough to cause it to fail. In the two succeeding tests (one with #460 completely reverted, and one just getting rid of std::call_once
), I ran it with --retest-until-fail 10
, and they both passed every time. It could be an extreme flake, but it really does seem to me that something about only initializing once is causing the problem. Maybe this is just exposing another bug, I'm not sure.
I really do think this is the problem, though. In the unmodified failing build above, I ran it with --retest-until-fail 2, and that was enough to cause it to fail. In the two succeeding tests (one with #460 completely reverted, and one just getting rid of std::call_once), I ran it with --retest-until-fail 10, and they both passed every time. It could be an extreme flake, but it really does seem to me that something about only initializing once is causing the problem. Maybe this is just exposing another bug, I'm not sure.
I see.
I'm more comfortable with first trying to extend the timeout of the test (number of attempts here), and if that doesn't work I don't mind if the std::call_once
is removed or the test is marked as xfail.
IMO, this may be a case of spooky action at distance, maybe some of the API that init()
is calling forces all participants of the participant factory to somehow "sync/update" their graphs (as a magic side effect :sparkles:), but I don't really think there's something wrong with using std::once
here.
It would be good if someone else can check if the usage of std::once
is correct here, maybe @hidmic ?
if the usage of std::once is correct here
Use of std::call_once
seems correct to me too.
IMO, this may be a case of spooky action at distance
Hmm, I wouldn't be surprised but that doesn't explain why it only manifests itself on Windows. It smells like an msvc
bug to me. @clalancette is std::call_once
actually calling the function?
Hmm, I wouldn't be surprised but that doesn't explain why it only manifests itself on Windows. It smells like an
msvc
bug to me. @clalancette isstd::call_once
actually calling the function?
I'm honestly not sure, I didn't instrument it. Unfortunately, I don't have an easy way to reproduce this at the moment. I'll leave this on my list to take a look at in the new year.
Hmm, I wouldn't be surprised but that doesn't explain why it only manifests itself on Windows.
In CI, it happens frequently that flaky tests are more prone to fail on Windows (and with rmw_connext).
It smells like an msvc bug to me.
That sounds unlikely to me ...
In CI, it happens frequently that flaky tests are more prone to fail on Windows (and with rmw_connext).
For time sensitive tests, yes, but this looks different (considering above evidence of this working just fine if std::call_once
is dropped).
That sounds unlikely to me ...
It could be a flaw in RTI Connext, as you suggest, but I'd be surprised. It seems simpler for DomainParticipantFactory
to do things right than to do them wrong.
Closing, since with Foxy now being End-of-Life, this repository is no longer used.