rmw_fastrtps icon indicating copy to clipboard operation
rmw_fastrtps copied to clipboard

Improve service discovery

Open hidmic opened this issue 4 years ago • 35 comments

Feature request

Feature description

Follow-up issue after #390. Prior to that PR, services' discovery would rely on built-in request/reply topics discovery, which can and did lead to races (see https://github.com/ros2/ros2/issues/922). A short-term workaround was then introduced by #390, but not without caveats i.e. blocking behavior of rmw_send_response, non-standard (ab)use of sample identity data.

OMG DDS-RPC 1.0 spec describes both the issues associated with independent discovery of request/reply topics (see Basic Service Mapping under section 7.6.1) and an algorithm to prevent them (see Enhanced Service Mapping under section 7.6.2). Services' implementation in rmw_fastrtps*_cpp could and probably should move in that direction.

hidmic avatar May 28 '20 13:05 hidmic

@richiware Friendly FYI. Does eProsima have any plans to work on this one?

sloretz avatar Jun 05 '20 16:06 sloretz

We were talking this morning about this. We have in mind to support deeply in the future the DDS-RPC standard on FastDDS. But right now the service discovery issue will be addressed by #390 solution. There are other features/issues that need our focus.

richiware avatar Jun 08 '20 11:06 richiware

@richiware #390 is not a solution, it is a workaround at best and seemingly not an effective one for all cases (see https://github.com/ros2/rclcpp/issues/1152 and https://github.com/ros2/ros2/issues/931). To the extent that it is possible, please, do consider addressing this issue at some point in the near future.

hidmic avatar Jun 08 '20 14:06 hidmic

I agree with @hidmic that #390 is a workaround.

Though a full implementation of DDS-RPC standard would be great, that's a lot of effort. Support of an enhanced discovery mechanism to avoid the race condition should be enough to solve the problem.

ivanpauno avatar Jun 08 '20 17:06 ivanpauno

We've added your suggestion on our roadmap.

richiware avatar Jun 12 '20 06:06 richiware

To summarize issues from https://github.com/ros2/ros2/issues/931 with respect to navigation reported by @daisukes

Upgrading to 2.x.x from 1.x.x results in services failing to function (<5% successful calls), which is a safety issue with fast-dds. Rolling back fixes this.

SteveMacenski avatar Jun 18 '20 19:06 SteveMacenski

@richiware With the report mention in the previous comment I think there is a clear regression which should be addressed asap.

dirk-thomas avatar Jun 18 '20 19:06 dirk-thomas

@MiguelCompany could you help in this?

richiware avatar Jul 02 '20 09:07 richiware

@JaimeMartin can we get an update on this? This is a safety critical issue and we may have to switch primary support and instructions for Navigation2 to Cyclone if we can't reliably call services in Foxy. It could genuinely cause complete system failure.

SteveMacenski avatar Jul 07 '20 19:07 SteveMacenski

@SteveMacenski Looking at ros2/ros2#931 I see all the analysis was done before #390 was merged. I know it is a workaround and it has not made its way to a Foxy patch release (it is only in master), but have you checked if it improves the behavior?

Regarding the commit that seems to be responsible, it is the merge commit of eProsima/Fast-DDS#1190. Among other things, that PR fixes a wrong behavior with volatile late-joiners. Before the PR, volatile late-joiners were (incorrectly) receiving old samples in some cases, and fixing that to be compliant with RTPS standard, makes the service discovery race more likely to happen. Commits affecting volatile late-joiners are eProsima/Fast-DDS@a1c35b5bf2f991e381770f24c3131184efce4c39 and eProsima/Fast-DDS@30f8f6f4fff54c99244b01d3d304f0edd11a0aad

Nevertheless, we have just started analyzing ros-planning/navigation2#1772 and, at the same time, designing a proper service discovery on rmw_fastrtps.

MiguelCompany avatar Jul 08 '20 06:07 MiguelCompany

We also had to drop some unit tests as well because of this issue causing them to fail with Fast-RTPS only https://github.com/ros-planning/navigation2/pull/1856 and reports of various elements of navigation2 not working (recovery behaviors) due to services not working.

Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?

SteveMacenski avatar Jul 08 '20 19:07 SteveMacenski

Have you released this update so that it will be in ROS2 foxy after the sync Friday (https://discourse.ros.org/t/preparing-for-foxy-sync-and-patch-release-2020-07-10/15265)?

The workaround in #390 dropped from my radar. I've added it to the project board and will make sure it gets into the next Foxy sync.

jacobperron avatar Jul 13 '20 23:07 jacobperron

@dirk-thomas @jacobperron As you can see here the issue has been solved by eProsima/Fast-DDS#1295. I hope this can be released in foxy ASAP.

MiguelCompany avatar Jul 20 '20 06:07 MiguelCompany

@MiguelCompany With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?

jacobperron avatar Jul 20 '20 21:07 jacobperron

With eProsima/Fast-DDS#1295, does it mean that the changes introduced in #390 are no longer required (or should #390 still be backported to Foxy)?

@jacobperron No, the changes in #390 are still necessary, as there are still chances of asymmetric discovery of endpoints. Nevertheless, we are preparing a design for a different approach to how the service discovery race is handled. Although the new mechanism will be inspired by the DDS-RPC standard, it will not strictly follow it.

We will open a PR in a few days with the proposed design.

MiguelCompany avatar Jul 21 '20 07:07 MiguelCompany

@MiguelCompany If you can create a new release tag for the Fast-DDS 2.x.x branch, I can help bloom a new release into rosdistro for Foxy.

jacobperron avatar Jul 21 '20 16:07 jacobperron

I would suggest to wait until eProsima/Fast-DDS#1304 has landed too (which should hopefully be soon - CI is running as we speak).

dirk-thomas avatar Jul 21 '20 17:07 dirk-thomas

@dirk-thomas @jacobperron The warnings related PRs have been merged, and I have created tag 2.0.1-rc, which will most likely become release v2.0.1 after we proceed with release testing.

MiguelCompany avatar Jul 23 '20 05:07 MiguelCompany

We have done the design to address the problem with the service discovery. You can find it documented on PR #418

IkerLuengo avatar Jul 31 '20 08:07 IkerLuengo

Hi, I am facing several issues with service reliability when using ROS2 Galactic with Fast DDS over SHM. Also using discovery server. Most common occurrences:

  1. Request is sent by client and response isn't received : This happens even with RELIABLE and KEEP_ALL QOS settings. However sending another request after a timeout works fine.
  2. Client fails to discover service during startup when using wait_for_service, even though the service has started up fine.

Is there anything on the ROS2 roadmap that will help address service reliability in the near term? We just switched to ROS2 and this issue is hurting our ability to build a stack around it, because being able to have a service driven infrastructure is critical to efficient use of CPU on low compute platforms.

aditya2592 avatar Jun 22 '22 00:06 aditya2592

HI @aditya2592 , are you guys running into these issues while using docker or also on a normal installation? We could reproduce the first issue (see https://github.com/MarcoMatteoBassa/reproduce_client_bug) but only while running on a docker container (not specifically with fastrtps, happens with cyclone too), so we are wondering if there's any docker-specific problem.

MarcoMatteoBassa avatar Jun 24 '22 07:06 MarcoMatteoBassa

Hi @MarcoMatteoBassa, yes we are also running through a docker, thats awesome that you were able to reproduce it. I have been having trouble doing that because it is indeed effected by system load and low compute platforms. On top of that, everything we have so far is on the same system ie on localhost itself. Because of these issues on the same machine, we havent been able to expand to ROS2 across machines. We also switched to using the Fast DDS discovery server with below config as suggested in the tutorials, though I am still unsure why discovery server would be needed to discover services on the same machine

<?xml version="1.0" encoding="UTF-8" ?>
<dds>
    <profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
        <participant profile_name="super_client_profile" is_default_profile="true">
            <rtps>
                <sendSocketBufferSize>1048576</sendSocketBufferSize>
                <listenSocketBufferSize>4194304</listenSocketBufferSize>
                <builtin>
                    <discovery_config>
                        <discoveryProtocol>SUPER_CLIENT</discoveryProtocol>
                        <discoveryServersList>
                            <RemoteServer prefix="44.53.00.5f.45.50.52.4f.53.49.4d.41">
                                <metatrafficUnicastLocatorList>
                                    <locator>
                                        <udpv4>
                                            <address>127.0.0.1</address>
                                            <port>11811</port>
                                        </udpv4>
                                    </locator>
                                </metatrafficUnicastLocatorList>
                            </RemoteServer>
                        </discoveryServersList>
                    </discovery_config>
                </builtin>
            </rtps>
        </participant>
    </profiles>
</dds>

aditya2592 avatar Jun 24 '22 15:06 aditya2592

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

MarcoMatteoBassa avatar Jun 24 '22 16:06 MarcoMatteoBassa

@MiguelCompany this could be related to https://github.com/ros2/rmw_fastrtps/pull/616?

fujitatomoya avatar Jun 24 '22 16:06 fujitatomoya

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

We had a lot more issues with cyclone so we stopped experimenting with that early on. There were memory leaks and general latency was not good with cyclone. But I have seen issues on both galactic and humble using Fast DDS with/without discovery server, with all services and clients on the same machine

aditya2592 avatar Jun 24 '22 17:06 aditya2592

@aditya2592 did you run into those issues even when running on cyclone-DDS (without the discovery server) on galactic? For Humble apparently something is moving https://discourse.ros.org/t/nav2-issues-with-humble-binaries-due-to-fast-dds-rmw-regression/26128/2

@MarcoMatteoBassa Were you able to run your experiment with the fix in this MR - https://github.com/ros2/rmw_fastrtps/pull/616 ?

aditya2592 avatar Jul 05 '22 21:07 aditya2592

Hi @aditya2592 @fujitatomoya , I just tried to use the latest version of osrf/ros2:nightly (verified that the specified fix is there) as base for my test in https://github.com/MarcoMatteoBassa/reproduce_client_bug but I can still reproduce the issue

MarcoMatteoBassa avatar Jul 07 '22 13:07 MarcoMatteoBassa

@MarcoMatteoBassa thanks for checking, i cannot reproduce with your test but technically there is a race condition for service discovery as described on this thread.

fujitatomoya avatar Jul 07 '22 16:07 fujitatomoya

@fujitatomoya thanks for trying, JFI, here is the corresponding issue (opend on the docker images repo because I can only reproduce it on docker) https://github.com/osrf/docker_images/issues/628

MarcoMatteoBassa avatar Jul 07 '22 16:07 MarcoMatteoBassa

The change in #616 has been merged, but not released yet. That means it is definitely not in the docker images yet. We'll have to do a new release of rmw_fastrtps into Rolling, and then rebuild the docker images before the test is valid.

In the meantime, @MarcoMatteoBassa is there any chance you can test from sources instead? You can follow the instructions at https://docs.ros.org/en/rolling/Installation/Alternatives/Ubuntu-Development-Setup.html to do that.

clalancette avatar Jul 07 '22 16:07 clalancette