demos
demos copied to clipboard
[demo_nodes_cpp] add_two_ints service is flakey
Bug report
Required Info:
- Operating System:
- OSX High Sierra
- Installation type:
- Binary
- Version or commit hash:
- Crystal (pre-release)
- DDS implementation:
- Fast-RTPS
- Client library (if applicable):
- rclcpp
Steps to reproduce issue
In one terminal start the server:
ros2 run demo_nodes_cpp add_two_ints_server
In another terminal repeatedly run the client:
ros2 run demo_nodes_cpp add_two_ints_client
The same behavior also occurs for the async client:
ros2 run demo_nodes_cpp add_two_ints_client_async
Expected behavior
Server receives clients request and client gets the response.
Actual behavior
Sometimes (~%30 of the time) the server never gets the request and the client hangs (does not return).
Additional information
Cannot reproduce with OpenSplice and Connext, which leads me to believe that this is a Fast-RTPS issue.
I was unable to reproduce this using Fast-RTPS a from-source build on Bionic
Using
for i in $(seq 1 200); do
printf "Run #$i:"
ros2 run demo_nodes_cpp add_two_ints_client
done
returned successfully all 200 times with both the regular and async client.
I ran the same test as @nuclearsandwich with Connext on Xenial. There, out of 200 runs, I got exactly 1 run where the client hung forever, and 1 run where the client had to try several times to discover the server (the server remained running the entire time).
59/200 where client hung. Fast-RTSP on High Sierra.
I was able to reproduce the issue on Windows fat binaries both with FastRTPS and Connext using this .bat file:
@echo off
FOR /L %%d IN (1,1,200) DO (
echo Run %%d
ros2 run demo_nodes_cpp add_two_ints_client
)
With FastRTPS it usually hang on the 5th - 8th iteration whereas with Connext it hang in the 15th - 17th
I haven't been able to repro this. I've tried:
- 1000 iterations on Fedora from Source using FastRTPS
- 200 iterations on 18.04 ARM64 from debs using FastRTPS or OpenSplice
- 200 iterations on 18.04 ARM64 from source using FastRTPS or OpenSplice
Also, running a Python server with a C++ client results in the same issue:
ros2 run demo_nodes_py add_two_ints_server
I was able to reproduce with Fast-RTPS on Xenial by also making the CPU busy during the test (e.g. building a workspace with colcon). The client remains hung even after the CPU usage lowers.
This sounds likely related ros2/rmw_fastrtps#239
FYI for a more reproducible stress test I recommend the stress tool. It's available as the deb stress
FYI for a more reproducible stress test I recommend the stress tool. It's available as the deb stress
Thanks, I discovered that soon after :+1:
Completely separately it's just been submitted as a rosdep key too: https://github.com/ros/rosdistro/pull/19870
Potentially related: https://github.com/ros2/rmw_fastrtps/pull/238#pullrequestreview-190880807
It appears that when the client hangs, it could be due to the number of requested subscribers being reported as zero at the rmw layer (even though there is a service server running). This doesn't explain the same issue with Connext though. Further investigation is required.
I'm not sure I follow the status of this thread. I see @nuclearsandwich said:
I was unable to reproduce this using Fast-RTPS a from-source build on Bionic
Do we have a table of which OS and rmw implementations have this problem?
The reason I ask is we see some behavior in the Navigation2 stack on Bionic that may be explained by this, it's not clear to me from this thread.
wait, now I see that https://github.com/ros2/examples/issues/228 is the equivalent for Bionic (18.04).
So what's the status? Is this bug root caused?
The reason I ask is we see some behavior in the Navigation2 stack on Bionic that may be explained by this, it's not clear to me from this thread.
Could you give some more details about your problem?
So what's the status? Is this bug root caused?
I don't think so. If you look carefully at the reports above, it looks like it mostly happens with Fast-RTPS (or, at least, it was mostly tested with Fast-RTPS). There also seems to be some reports of it with Connext.
There is some possibility that this problem will be improved by the discovery changes that went into FastRTPS 1.7.1 (specifically https://github.com/eProsima/Fast-RTPS/pull/411). But that is not a root cause analysis, and as it is we can't upgrade Crystal to FastRTPS 1.7.1 since it is ABI incompatible (though it is API compatible).
@mkhansen-intel There are a couple of things you could try:
- Try running with opensplice and see if the situation improves for you. That will at least point to whether it is in the underlying RTPS/DDS implementation or in ROS 2 proper.
- Compile FastRTPS 1.7.1 and try it with Nav2 to see if it improves things. Again, the results of this test will help to narrow down whether it is in Fast-RTPS or in ROS 2. Also, if this improves things, this will give us even more motivation to backport the FastRTPS 1.7.1 changes into Crystal in an ABI-compatible way.
This issue also appears frequently (usually within 20 iterations) with Opensplice on arm64v8 Ubuntu 18.04. Specifically using build from https://ci.ros2.org/view/packaging/job/packaging_linux-aarch64/824/.
Additionally, I see the same behavior more frequently (usually within 5 iterations) with other demos such as minimal_service/minimal_client
terminal 1:
RMW_IMPLEMENTATION=rmw_opensplice_cpp ros2 run examples_rclcpp_minimal_service service_main
terminal 2:
while :; do RMW_IMPLEMENTATION=rmw_opensplice_cpp ros2 run examples_rclcpp_minimal_client client_main; done