rmw_fastrtps
rmw_fastrtps copied to clipboard
Improve service discovery
Feature request
Feature description
Follow-up issue after #390. Prior to that PR, services' discovery would rely on built-in request/reply topics discovery, which can and did lead to races (see https://github.com/ros2/ros2/issues/922). A short-term workaround was then introduced by #390, but not without caveats i.e. blocking behavior of rmw_send_response
, non-standard (ab)use of sample identity data.
OMG DDS-RPC 1.0 spec describes both the issues associated with independent discovery of request/reply topics (see Basic Service Mapping under section 7.6.1) and an algorithm to prevent them (see Enhanced Service Mapping under section 7.6.2). Services' implementation in rmw_fastrtps*_cpp
could and probably should move in that direction.
@richiware Friendly FYI. Does eProsima have any plans to work on this one?
We were talking this morning about this. We have in mind to support deeply in the future the DDS-RPC standard on FastDDS. But right now the service discovery issue will be addressed by #390 solution. There are other features/issues that need our focus.
@richiware #390 is not a solution, it is a workaround at best and seemingly not an effective one for all cases (see https://github.com/ros2/rclcpp/issues/1152 and https://github.com/ros2/ros2/issues/931). To the extent that it is possible, please, do consider addressing this issue at some point in the near future.
I agree with @hidmic that #390 is a workaround.
Though a full implementation of DDS-RPC
standard would be great, that's a lot of effort.
Support of an enhanced discovery mechanism to avoid the race condition should be enough to solve the problem.
We've added your suggestion on our roadmap.
To summarize issues from https://github.com/ros2/ros2/issues/931 with respect to navigation reported by @daisukes
Upgrading to 2.x.x from 1.x.x results in services failing to function (<5% successful calls), which is a safety issue with fast-dds. Rolling back fixes this.
@richiware With the report mention in the previous comment I think there is a clear regression which should be addressed asap.
@MiguelCompany could you help in this?
@JaimeMartin can we get an update on this? This is a safety critical issue and we may have to switch primary support and instructions for Navigation2 to Cyclone if we can't reliably call services in Foxy. It could genuinely cause complete system failure.
@SteveMacenski Looking at ros2/ros2#931 I see all the analysis was done before #390 was merged. I know it is a workaround and it has not made its way to a Foxy patch release (it is only in master), but have you checked if it improves the behavior?
Regarding the commit that seems to be responsible, it is the merge commit of eProsima/Fast-DDS#1190. Among other things, that PR fixes a wrong behavior with volatile late-joiners. Before the PR, volatile late-joiners were (incorrectly) receiving old samples in some cases, and fixing that to be compliant with RTPS standard, makes the service discovery race more likely to happen. Commits affecting volatile late-joiners are eProsima/Fast-DDS@a1c35b5bf2f991e381770f24c3131184efce4c39 and eProsima/Fast-DDS@30f8f6f4fff54c99244b01d3d304f0edd11a0aad
Nevertheless, we have just started analyzing ros-planning/navigation2#1772 and, at the same time, designing a proper service discovery on rmw_fastrtps.
We also had to drop some unit tests as well because of this issue causing them to fail with Fast-RTPS only https://github.com/ros-planning/navigation2/pull/1856 and reports of various elements of navigation2 not working (recovery behaviors) due to services not working.
Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?
Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?
The workaround in #390 dropped from my radar. I've added it to the project board and will make sure it gets into the next Foxy sync.
@dirk-thomas @jacobperron As you can see here the issue has been solved by eProsima/Fast-DDS#1295. I hope this can be released in foxy ASAP.
@MiguelCompany With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?
With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?
@jacobperron No, the changes in #390 are still necessary, as there are still chances of asymmetric discovery of endpoints. Nevertheless, we are preparing a design for a different approach to how the service discovery race is handled. Although the new mechanism will be inspired by the DDS-RPC standard, it will not strictly follow it.
We will open a PR in a few days with the proposed design.
@MiguelCompany If you can create a new release tag for the Fast-DDS 2.x.x
branch, I can help bloom a new release into rosdistro for Foxy.
I would suggest to wait until eProsima/Fast-DDS#1304 has landed too (which should hopefully be soon - CI is running as we speak).
@dirk-thomas @jacobperron The warnings related PRs have been merged, and I have created tag 2.0.1-rc
, which will most likely become release v2.0.1 after we proceed with release testing.
We have done the design to address the problem with the service discovery. You can find it documented on PR #418
Hi, I am facing several issues with service reliability when using ROS2 Galactic with Fast DDS over SHM. Also using discovery server. Most common occurrences:
- Request is sent by client and response isn't received : This happens even with
RELIABLE
andKEEP_ALL
QOS settings. However sending another request after a timeout works fine. - Client fails to discover service during startup when using
wait_for_service
, even though the service has started up fine.
Is there anything on the ROS2 roadmap that will help address service reliability in the near term? We just switched to ROS2 and this issue is hurting our ability to build a stack around it, because being able to have a service driven infrastructure is critical to efficient use of CPU on low compute platforms.
HI @aditya2592 , are you guys running into these issues while using docker or also on a normal installation? We could reproduce the first issue (see https://github.com/MarcoMatteoBassa/reproduce_client_bug) but only while running on a docker container (not specifically with fastrtps, happens with cyclone too), so we are wondering if there's any docker-specific problem.
Hi @MarcoMatteoBassa, yes we are also running through a docker, thats awesome that you were able to reproduce it. I have been having trouble doing that because it is indeed effected by system load and low compute platforms. On top of that, everything we have so far is on the same system ie on localhost itself. Because of these issues on the same machine, we havent been able to expand to ROS2 across machines. We also switched to using the Fast DDS discovery server with below config as suggested in the tutorials, though I am still unsure why discovery server would be needed to discover services on the same machine
<?xml version="1.0" encoding="UTF-8" ?>
<dds>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<participant profile_name="super_client_profile" is_default_profile="true">
<rtps>
<sendSocketBufferSize>1048576</sendSocketBufferSize>
<listenSocketBufferSize>4194304</listenSocketBufferSize>
<builtin>
<discovery_config>
<discoveryProtocol>SUPER_CLIENT</discoveryProtocol>
<discoveryServersList>
<RemoteServer prefix="44.53.00.5f.45.50.52.4f.53.49.4d.41">
<metatrafficUnicastLocatorList>
<locator>
<udpv4>
<address>127.0.0.1</address>
<port>11811</port>
</udpv4>
</locator>
</metatrafficUnicastLocatorList>
</RemoteServer>
</discoveryServersList>
</discovery_config>
</builtin>
</rtps>
</participant>
</profiles>
</dds>
@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2
@MiguelCompany this could be related to https://github.com/ros2/rmw_fastrtps/pull/616?
@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2
We had a lot more issues with cyclone so we stopped experimenting with that early on. There were memory leaks and general latency was not good with cyclone. But I have seen issues on both galactic and humble using Fast DDS with/without discovery server, with all services and clients on the same machine
@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2
@MarcoMatteoBassa Were you able to run your experiment with the fix in this MR - https://github.com/ros2/rmw_fastrtps/pull/616 ?
Hi @aditya2592 @fujitatomoya , I just tried to use the latest version of osrf/ros2:nightly (verified that the specified fix is there) as base for my test in https://github.com/MarcoMatteoBassa/reproduce_client_bug but I can still reproduce the issue
@MarcoMatteoBassa thanks for checking, i cannot reproduce with your test but technically there is a race condition for service discovery as described on this thread.
@fujitatomoya thanks for trying, JFI, here is the corresponding issue (opend on the docker images repo because I can only reproduce it on docker) https://github.com/osrf/docker_images/issues/628
The change in #616 has been merged, but not released yet. That means it is definitely not in the docker images yet. We'll have to do a new release of rmw_fastrtps
into Rolling, and then rebuild the docker images before the test is valid.
In the meantime, @MarcoMatteoBassa is there any chance you can test from sources instead? You can follow the instructions at https://docs.ros.org/en/rolling/Installation/Alternatives/Ubuntu-Development-Setup.html to do that.