demos icon indicating copy to clipboard operation
demos copied to clipboard

[demo_nodes_cpp] add_two_ints service is flakey

Open jacobperron opened this issue 6 years ago • 15 comments

Bug report

Required Info:

  • Operating System:
    • OSX High Sierra
  • Installation type:
    • Binary
  • Version or commit hash:
    • Crystal (pre-release)
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclcpp

Steps to reproduce issue

In one terminal start the server:

ros2 run demo_nodes_cpp add_two_ints_server

In another terminal repeatedly run the client:

ros2 run demo_nodes_cpp add_two_ints_client

The same behavior also occurs for the async client:

ros2 run demo_nodes_cpp add_two_ints_client_async

Expected behavior

Server receives clients request and client gets the response.

Actual behavior

Sometimes (~%30 of the time) the server never gets the request and the client hangs (does not return).

Additional information

Cannot reproduce with OpenSplice and Connext, which leads me to believe that this is a Fast-RTPS issue.

jacobperron avatar Dec 14 '18 19:12 jacobperron

I was unable to reproduce this using Fast-RTPS a from-source build on Bionic

Using

for i in $(seq 1 200); do
  printf "Run #$i:"
  ros2 run demo_nodes_cpp add_two_ints_client
done

returned successfully all 200 times with both the regular and async client.

nuclearsandwich avatar Dec 14 '18 19:12 nuclearsandwich

I ran the same test as @nuclearsandwich with Connext on Xenial. There, out of 200 runs, I got exactly 1 run where the client hung forever, and 1 run where the client had to try several times to discover the server (the server remained running the entire time).

clalancette avatar Dec 14 '18 19:12 clalancette

59/200 where client hung. Fast-RTSP on High Sierra.

jacobperron avatar Dec 14 '18 20:12 jacobperron

I was able to reproduce the issue on Windows fat binaries both with FastRTPS and Connext using this .bat file:

@echo off
FOR /L %%d IN (1,1,200) DO  (
    echo Run %%d
    ros2 run demo_nodes_cpp add_two_ints_client
)

With FastRTPS it usually hang on the 5th - 8th iteration whereas with Connext it hang in the 15th - 17th

apojomovsky avatar Dec 14 '18 20:12 apojomovsky

I haven't been able to repro this. I've tried:

  • 1000 iterations on Fedora from Source using FastRTPS
  • 200 iterations on 18.04 ARM64 from debs using FastRTPS or OpenSplice
  • 200 iterations on 18.04 ARM64 from source using FastRTPS or OpenSplice

cottsay avatar Dec 14 '18 21:12 cottsay

Also, running a Python server with a C++ client results in the same issue:

ros2 run demo_nodes_py add_two_ints_server

jacobperron avatar Dec 14 '18 23:12 jacobperron

I was able to reproduce with Fast-RTPS on Xenial by also making the CPU busy during the test (e.g. building a workspace with colcon). The client remains hung even after the CPU usage lowers.

This sounds likely related ros2/rmw_fastrtps#239

jacobperron avatar Jan 04 '19 01:01 jacobperron

FYI for a more reproducible stress test I recommend the stress tool. It's available as the deb stress

tfoote avatar Jan 07 '19 19:01 tfoote

FYI for a more reproducible stress test I recommend the stress tool. It's available as the deb stress

Thanks, I discovered that soon after :+1:

jacobperron avatar Jan 07 '19 19:01 jacobperron

Completely separately it's just been submitted as a rosdep key too: https://github.com/ros/rosdistro/pull/19870

tfoote avatar Jan 07 '19 20:01 tfoote

Potentially related: https://github.com/ros2/rmw_fastrtps/pull/238#pullrequestreview-190880807

It appears that when the client hangs, it could be due to the number of requested subscribers being reported as zero at the rmw layer (even though there is a service server running). This doesn't explain the same issue with Connext though. Further investigation is required.

jacobperron avatar Feb 15 '19 20:02 jacobperron

I'm not sure I follow the status of this thread. I see @nuclearsandwich said:

I was unable to reproduce this using Fast-RTPS a from-source build on Bionic

Do we have a table of which OS and rmw implementations have this problem?

The reason I ask is we see some behavior in the Navigation2 stack on Bionic that may be explained by this, it's not clear to me from this thread.

mkhansenbot avatar Mar 01 '19 02:03 mkhansenbot

wait, now I see that https://github.com/ros2/examples/issues/228 is the equivalent for Bionic (18.04).

So what's the status? Is this bug root caused?

mkhansenbot avatar Mar 01 '19 02:03 mkhansenbot

The reason I ask is we see some behavior in the Navigation2 stack on Bionic that may be explained by this, it's not clear to me from this thread.

Could you give some more details about your problem?

So what's the status? Is this bug root caused?

I don't think so. If you look carefully at the reports above, it looks like it mostly happens with Fast-RTPS (or, at least, it was mostly tested with Fast-RTPS). There also seems to be some reports of it with Connext.

There is some possibility that this problem will be improved by the discovery changes that went into FastRTPS 1.7.1 (specifically https://github.com/eProsima/Fast-RTPS/pull/411). But that is not a root cause analysis, and as it is we can't upgrade Crystal to FastRTPS 1.7.1 since it is ABI incompatible (though it is API compatible).

@mkhansen-intel There are a couple of things you could try:

  1. Try running with opensplice and see if the situation improves for you. That will at least point to whether it is in the underlying RTPS/DDS implementation or in ROS 2 proper.
  2. Compile FastRTPS 1.7.1 and try it with Nav2 to see if it improves things. Again, the results of this test will help to narrow down whether it is in Fast-RTPS or in ROS 2. Also, if this improves things, this will give us even more motivation to backport the FastRTPS 1.7.1 changes into Crystal in an ABI-compatible way.

clalancette avatar Mar 01 '19 13:03 clalancette

This issue also appears frequently (usually within 20 iterations) with Opensplice on arm64v8 Ubuntu 18.04. Specifically using build from https://ci.ros2.org/view/packaging/job/packaging_linux-aarch64/824/.

Additionally, I see the same behavior more frequently (usually within 5 iterations) with other demos such as minimal_service/minimal_client terminal 1: RMW_IMPLEMENTATION=rmw_opensplice_cpp ros2 run examples_rclcpp_minimal_service service_main terminal 2: while :; do RMW_IMPLEMENTATION=rmw_opensplice_cpp ros2 run examples_rclcpp_minimal_client client_main; done

cevans87 avatar May 30 '19 23:05 cevans87