rmw_fastrtps Fast-DDS service discovery redesign

This is the design for the solution to #392.

On the server, the requests are held back until the response subscriber is matched. Once the matching occurs, all the pending requests corresponding to that client are sent to processing.

On the client, a list of fully-matched servers is kept, i.e., those for which both the request subscriber and the response publisher are matched. Only if this list is not empty does rmw_service_server_is_available return true.

The correspondence between publisher and subscriber GUIDs on one endpoint (server or client) is shared with the other endpoint through USER_DATA. If the remote endpoint does not have this solution implemented (does not share the GUID correspondence), legacy behavior is kept.

Jul 31 '20 08:07 IkerLuengo

Corrected as per suggestions from @hidmic

Aug 03 '20 10:08 IkerLuengo

LGTM with @hidmic comments addressed

Aug 07 '20 20:08 ivanpauno

@IkerLuengo friendly ping.

Aug 24 '20 16:08 hidmic

Hi Michel,

Iker come back tomorrow from vacation.

El lun., 24 ago. 2020 18:26, Michel Hidalgo [email protected] escribió:

@IkerLuengo https://github.com/IkerLuengo friendly ping.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ros2/rmw_fastrtps/pull/418#issuecomment-679231412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3AUVIHT2DL3XU4VIHIPSTSCKIDFANCNFSM4PPZGEKA .

Aug 24 '20 17:08 JaimeMartin

Modified as suggested by @hidmic.

Aug 25 '20 09:08 IkerLuengo

Overall, I think the idea makes a lot of sense. Thanks for pushing on this.

I have a few concerns inline which are mostly about some implementation details, but nothing that would prevent us from going forward with this idea.

Finally, @eboasson I know this particular article is about Fast DDS, but I think a similar idea would apply to Cyclone DDS as well. You've mentioned as much in some of your previous feedback (like ros2/rmw_cyclonedds#187 (comment)). Would you mind taking a look here and leaving your thoughts? If this becomes a generic solution, then I would take the implementation-independent parts out of this document and put them somewhere more generic (maybe https://design.ros2.org). Thanks.

There are three independent problems here:

Associating the two endpoints with the client/service. What is proposed here is not so different from what Cyclone's RMW currently implements, but:
1. C. adds a clientid/serviceid "property" to the USER_DATA QoS for the client/service's request and response endpoints, which I think is much more elegant because any reader/writer can immediately be associated with a client/service.
2. C. follows the formatting of the USER_DATA of the participant. (Originally used for node name/type, today only used for the enclave property.) The content is the participant's GUID prefix followed by a unique number, formatted as a series of 2-digits hex numbers separated by dots, but any unique string will do.
Discovery of the service by the client being independent of discovery of the client by the service. Receipt of a request implies most of the client has been discovered, just not that the client's response reader has been discovered yet. Here, C. chooses to delay the sending of the response rather than the handling of the request, to avoids having to store and track deferred requests. In practice the wait generally ends up being a very short delay, and that only the first time a new client issues a request immediately after being created. The fundamental problem here is the (current) inability to detect that the remote side has completed discovery. The effort spent on designing workarounds (which this discovery redesign is, too) would arguably be better spent on fixing DDS for guaranteeing this. It is something that can be dealt with within the DDS implementations, the only thing requiring updates to the specifications is providing such a guarantee across implementations.
The ROS2 service model doesn't really allow for network connectivity issues: once "service_is_available" returns true, the assumption is that a service is and will remain available, and that sending a request will result in receiving a response. There are failure cases where the client will not be able to detect with certainty that a response will not be received (it is a bit of an edge case, but it can happen if the service loses sight of the client temporarily, especially in combination with multiple outstanding requests). Most cases can be detected, but at a pretty significant increase in complexity in the RMW implementation.

Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition). This could be the rmw_dds_common library, but I suspect that gettings this bit right is an issue for more than just the DDS RMW layers and it might well make sense to do it in rcl (or some other, new library).

Nov 09 '20 09:11 eboasson

C. adds a clientid/serviceid "property" to the USER_DATA QoS for the client/service's request and response endpoints, which I think is much more elegant because any reader/writer can immediately be associated with a client/service.

:+1: The client/service id (here) solution sounds more elegant.

C. follows the formatting of the USER_DATA of the participant.

:+1: I would also use the same format.

The ROS2 service model doesn't really allow for network connectivity issues: once "service_is_available" returns true, the assumption is that a service is and will remain available, and that sending a request will result in receiving a response

:+1:

Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition)

I'm not sure I understand, would the client have a guard condition to indicate if the server availability changed or something like that?

This could be the rmw_dds_common library, but I suspect that gettings this bit right is an issue for more than just the DDS RMW layers and it might well make sense to do it in rcl

The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API).

Nov 09 '20 14:11 ivanpauno

2\. The fundamental problem here is the (current) inability to detect that the remote side has completed discovery. The effort spent on designing workarounds (which this discovery redesign is, too) would arguably be better spent on fixing DDS for guaranteeing this. It is something that can be dealt with within the DDS implementations, the only thing requiring updates to the specifications is providing such a guarantee across implementations.

I'm not sure I understand this bit. Fundamentally, there is a race that exists here because you have 2 independent topics that implement the service. Both this solution (and the slightly different one in Cyclone) resolve that race by binding the two topics together somehow. How would you resolve this differently in the specification?

The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API).

Yeah, agreed here. I can definitely imagine other RMWs where the service is a true RPC call (more like DDS-RPC), and so you don't have this particular issue. So I think this part makes sense to implement in rmw_dds_common.

Nov 09 '20 16:11 clalancette

Probably the most sensible thing to do handle client and service operations and discovery (as well as their possible disappearance) in a common library (so handling reading/writing of requests and responses, and modelling the availability of a new request/response by triggering a guard condition)

I'm not sure I understand, would the client have a guard condition to indicate if the server availability changed or something like that?

What I was thinking of is the following: suppose the service defers requests from clients for which it hasn’t yet discovered the response reader, then the service implementation within the RMW layer suddenly has to monitor two sources of requests. The first is that of the requests arriving at the data reader for requests, and the second is that of the deferred requests for which the discovery happens to now have completed.

In the implementation, the application typically sits there waiting in the rmw_wait operation, but that one (ideally) maps to a DDS waitset (the impedance mismatch between the DDS waitset and the RMW waitset will hopefully get resolved at some point). The set of deferred requests lives outside DDS, and so you’d need some mechanism to trigger the waitset when a deferred request becomes ready. There are some complications on that front, none of them insurmountable, but they do add a fair amount of complexity.

Reworking the request handling to respond to the discovery information exchanged already by dds_rmw_common and the receipt of requests, but driven by a separate thread inside the RMW implementation (much like the discovery thread), would make life quite straightforward again. Then, any time a request becomes ready, that common implementation could simply trigger a guard condition — and at that point, as far as the waitset is concerned, a service is simply associated with a guard condition that gets triggered whenever there is a request waiting for the service.

Such an implementation would ideally be done in the dds_rmw_common, I’d say. Unless:

This could be the rmw_dds_common library, but I suspect that gettings this bit right is an issue for more than just the DDS RMW layers and it might well make sense to do it in rcl

The discovery issue in the "remote" side sounds like a DDS specific problem (maybe it can happen in other rmw implementations, but it doesn't sound completely general), all other changes to handle "network connectivity issues" that are not DDS specific I think the best thing to do is to handle them in rcl (which will likely involve extending rcl API).

I agree that the particular problem is quite DDS-specific at the moment. My view is that DDS ought to be improved to allow waiting until the remote has discovered the local entities (at which point service_is_available could just build on that, without a need to defer requests or delay sending the response). My thinking was that if DDS got this wrong for 15 years and counting, perhaps there are other middlewares that didn’t get this quite right either, and that it could be useful to have a service mechanism that takes care of this problem more generally. If so moving it into rcl or a new library could be useful. But it could equally well be done in rmw_dds_common first, it can always be generalized later.

(@clalancette, perhaps this also answers your question?)

Nov 09 '20 19:11 eboasson

In the implementation, the application typically sits there waiting in the rmw_wait operation, but that one (ideally) maps to a DDS waitset (the impedance mismatch between the DDS waitset and the RMW waitset will hopefully get resolved at some point). The set of deferred requests lives outside DDS, and so you’d need some mechanism to trigger the waitset when a deferred request becomes ready. There are some complications on that front, none of them insurmountable, but they do add a fair amount of complexity.

Ah yeah, that's true. There's a "manually implemented" wait set using a mutex/cond_var pair and listeners in rmw_fastrtps, so in the case here it's easier to "trigger" the wait set.

Reworking the request handling to respond to the discovery information exchanged already by dds_rmw_common and the receipt of requests, but driven by a separate thread inside the RMW implementation (much like the discovery thread), would make life quite straightforward again. Then, any time a request becomes ready, that common implementation could simply trigger a guard condition — and at that point, as far as the waitset is concerned, a service is simply associated with a guard condition that gets triggered whenever there is a request waiting for the service.

That sounds reasonable to me.

Unrelated note: about the impedance mismatch between DDS and ROS 2 waitset, that triggered the discussions in https://github.com/ros2/design/pull/305 and in this discourse post.

Nov 11 '20 15:11 ivanpauno

@IkerLuengo friendly ping !

May 31 '21 14:05 hidmic

rmw_fastrtps rmw_fastrtps copied to clipboard

Fast-DDS service discovery redesign

rmw_fastrtps
rmw_fastrtps copied to clipboard