rmw_connext
rmw_connext copied to clipboard
race condition in graph changes and service is available
I noticed this when debugging the flaky test in rcl
called test_rcl_service_server_is_available
which is in the rcl/test/rcl/test_graph.cpp
file:
https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L477
The race seems to be between the graph guard condition being triggered (and waiting wait sets being woken up):
https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L523
And the rcl_service_server_is_available
function reporting that a service that was previously available is no longer available:
https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L542
Normally the test only checks this when a change occurs in the graph, but this caused this test to fail with connext periodically. So I added a condition for connext where it will check on each loop regardless of whether or not a graph change was detected:
https://github.com/ros2/rcl/blob/db1353008bff40e87338c95fb46bcb4b85c970d6/rcl/test/rcl/test_graph.cpp#L525-L538
The rcl_service_server_is_available
function normally reported the right state on the next loop. This special case for connext should be removed after this is fixed.
This could be caused by graph changes getting combined through some sort of coalescing of events or it could be a delay introduced by connext, I'm not sure yet. I've decided to work around and document the issue rather than solve it now.