rclcpp Status of today's ROS 2 message delivery mechanisms

Opening a new ticket to follow up on some of the discussions that are going on in the ROS 2 Middleware WG and in https://github.com/ros2/rclcpp/issues/1642

As of today, ROS 2 supports various mechanisms to deliver messages between nodes, depending on where they are located and on various optimization configurations.

Intra Process communication

This mechanism allows to pass smart pointers between nodes in the same process. By being developed in rclcpp, the RMW is not anymore the only entity responsible for passing messages around. This may be considered a philosophical/design problem, but it also has very practical implications.

PROS:

It has the best performance in terms of CPU and latency, small RAM overhead.
It's independent from the chosen RMW.

CONS:

Maintenance cost: need to re-implement pub/sub/services/clients/actions as waitables rclcpp entities and QoS settings.
Huge overhead when you have both inter and intra process readers: https://github.com/ros2/rclcpp/issues/1722

Loaned message

This mechanism asks the RMW to create a message (which can be done from a shared memory segment). Then this message can be efficiently passed around across multiple processes in the same machine. See more details about performance in https://github.com/ros2/rclcpp/issues/1642#issuecomment-900387789

PROS:

Good performance for big messages

CONS:

Uses a separate set of APIs so a node must know in advance if it will need to use loaned messages.
Bad performance for small messages passed in a single process (equivalent to run without any optimization).
No support for dynamic data-types.

Standard RMW delivery

This is the standard mechanism that RMW uses to deliver messages regardless of the location of the entities.

PROS:

Common API

CONS:

Bad performance for single process and single machine communication.

The purpose of this ticket is to have a discussion on how to improve these mechanisms and have them coexist better.

FYI @wjwwood @mauropasse @ivanpauno @fujitatomoya

Aug 19 '21 07:08 alsora

In my mind, these are the main requirements for the ROS publish-subscribe APIs:

A single, configurable API allows a node to publish messages. The user shouldn't write code with one of those specific mechanisms in mind. Configurations allow to enable/disable optimizations or delivery modes (e.g. only intra-process).
Minimize the number of copies of the messages. I would say that multiple copies are allowed only if multiple mechanisms are being used under the hood, i.e. you have both a subscriber in the same process and a remote one.
Communication mechanisms should not interfere between each others: i.e. dds delivery should always know which subscribers to ignore if they are already been served through another mechanism.
Single process performance comparable to functional APIs.

Aug 19 '21 07:08 alsora

Quoting @wjwwood

I wonder about the hybrid approach we had in the past where the pointers and actual data were handled in rclcpp, but a message was passed via rmw (therefore via dds vendor) to notify the subscription. That might achieve good results with large messages where currently the rclcpp IPC (of today) out performs the loaned message approach.

in which scenario are you thinking about using the "hybrid approach"?

The latency for a message of size X should be approximately equivalent to: 10b RCLCPP OFF (for dds notification) + X RCLCPP ON (for handling the data in rclcpp). For example, with FastDDS and X=250KB I would expect a latency of: 200us + 50us = 250us.

This looks very similar to the performance of today's loaned message, but it would only work for a single process. Numbers extracted from https://github.com/ros2/rclcpp/issues/1642#issuecomment-900387789

Moreover, this wouldn't solve what I think today is the biggest problem of intra-process communication, i.e. the double delivery when inter process is also active (I looked back in the code and the workarounds for this were in place before we refactored intra-process comms).

Aug 19 '21 07:08 alsora

The latency for a message of size X should be approximately equivalent to: 10b RCLCPP OFF (for dds notification) + X RCLCPP ON (for handling the data in rclcpp). For example, with FastDDS and X=250KB I would expect a latency of: 200us + 50us = 250us.

I was imagining that it wouldn't simply be adding the times together, but instead there might be some saving by combining them, but even if that's not the case, in that case you mentioned, if you look at latency for CycloneDDS using a 600kb message, you'd get:

Just sending 600kb via loaned message ~300ms
Sending notification (10b) via IPC off ~100ms + sending 600kb via intra-process comm on ~125ms = ~225ms

Did I interpret that right?

For FastDDS @ 600kb it would be:

Just sending 600kb via loaned message ~210ms
Sending notification (10b) via IPC off ~140ms + sending 600kb via intra-process comm on ~75ms = ~215ms

Which is within the noise of the measurement I'd guess.

Now, these numbers are approximations, and I have no idea about the variability (how many times did you sample each scenario, is it an average?). Plus I don't know why these times are the way they are. Is it because it's all working as intended but the approach dictates the latency or are there bugs that once fixed would change the results?

Also, if using the "normal" or "traditional" publishing method is equivalent or better than loaned messages, then I'd imagine people would prefer not to use the loaned message API because it's a bit more difficult to use correctly (though we have tried to make it easy to do).

Also, we don't currently have API for doing loaned Services or Actions, so we'd need to add that interface, whereas it might be possible to significantly improve the performance of Services and Actions with the current API by extending the rclcpp intra-process, without disrupting existing user code.

This looks very similar to the performance of today's loaned message, but it would only work for a single process.

The point about only supporting a single process is true.

Moreover, this wouldn't solve what I think today is the biggest problem of intra-process communication, i.e. the double delivery when inter process is also active (I looked back in the code and the workarounds for this were in place before we refactored intra-process comms).

That's also true, but not something that cannot be overcome, in my opinion.

Aug 19 '21 22:08 wjwwood

Did I interpret that right?

Yes, that's how I expect the performance to be. We can probably say that we have improvements only for messages around 300KB-400KB or above.

Plus I don't know why these times are the way they are. Is it because it's all working as intended but the approach dictates the latency or are there bugs that once fixed would change the results?

This is an excellent question. I think that there are no "obvious bugs", but the approaches used could still be improved.

Having said that, I continue to think that moving back to the "old" intra-process is not the correct thing to do.

This would be a major regression in performance: even if DDS implementations have improved a lot since 2 years ago, the latency would be at least double the current one. Moreover, the approach would double the number of DDS entities in the system (you need publishers/subscribers to send the small messages) rather than relying on "rclcpp waitable entities" which are extremely cheaper. This will increase a lot the RAM usage of the application (just look at the difference with enabling/disabling parameters on nodes).

My proposal of considering to use a common RMW API to handle inter and intra process communication is tied to the fact that we can make this API to be as performant as the current intra-process implementation in the single process scenario.

The only advantage that I see with the "old" intra-process is that we would have direct access to DDS quality of service, rather than having to re-implement them. However, also here we would still have the problem that the history size wouldn't be honored.

About double delivery problem, I agree that it can be fixed, but this is true for both IPC approaches. The issue is, is it worth investing resources on this?

Could the intra-process machinery be moved to the RMW layer? Like in a rmw_common package that every RMW implementation can have access to. Here, before datatypes are serialized/converted to DDS types, we should have also access to implementation specific functions that may allow to handle all the corner cases, while still preserving performance.

Aug 20 '21 09:08 alsora

The old way is definitely not ideal, it just helps close the QoS gap.

My proposal of considering to use a common RMW API to handle inter and intra process communication is tied to the fact that we can make this API to be as performant as the current intra-process implementation in the single process scenario.

Maybe... At least we can get close.

The issue is, is it worth investing resources on this?

I think so, because being able to ignore delivery of data from "local" sources is potentially useful beyond intra-process.

Could the intra-process machinery be moved to the RMW layer? Like in a rmw_common package that every RMW implementation can have access to. Here, before datatypes are serialized/converted to DDS types, we should have also access to implementation specific functions that may allow to handle all the corner cases, while still preserving performance.

Yes, but I think it would be a lot harder to implement and has all the same issues as the rclcpp one, because I do not believe it addresses the double delivery issue (or at least not in any way that couldn't work with it outside the rmw layer) nor does it get us the QoS features for free.

If there's a way to tell the middleware not to deliver the data twice that can just be a pub/sub option at the rmw layer.

The only benefit to this approach is that other languages could use it, but it makes everything else about the implementation harder. For example, supporting custom types now would require code generation (most likely) and we have to re-implement the wheel on ownership and reference counting (since we don't have unqiue/shared ptr) and also we have to enforce type safety ourselves.

Aug 21 '21 00:08 wjwwood

If there's a way to tell the middleware not to deliver the data twice that can just be a pub/sub option at the rmw layer.

I think we need two separate pieces here:

when you create a subscription with intra-process comm enabled, this information needs to be propagated down to the RMW (FLAG 1)
when you publish a message both Inter and intra process, you need to propagate this do RMW as well (FLAG 2).

Then every publication that has the aforementioned flag 2, should not be delivered to subscriptions that were created with flag 1 enabled. This needs to be evaluated on each publication.

The mechanism could be greatly simplified if intra-process comm optimization was enabled at the process level rather than at the pub-sub level.

The added difficulty of having it at pub sub level is that you may have a process with 1 sub with intra-process comm enabled and 2 pubs on that topic, one with intra-process comm enabled and one disabled.

I know dds has options to say "ignore local publications" in a data reader, but this wouldn't work in this case.

Aug 21 '21 08:08 alsora

when you publish a message both Inter and intra process, you need to propagate this do RMW as well (FLAG 2).

Why is that? I think the publisher just needs to know not to send data to the intra-process subscriptions, and that's it. The middleware knows about this already.

The mechanism could be greatly simplified if intra-process comm optimization was enabled at the process level rather than at the pub-sub level.

It's done at the context level, but I don't understand this point, can you elaborate? How else can it work than at a pub/sub level? Do you mean configured less granularly?

The added difficulty of having it at pub sub level is that you may have a process with 1 sub with intra-process comm enabled and 2 pubs on that topic, one with intra-process comm enabled and one disabled.

Why is that a problem?

I know dds has options to say "ignore local publications" in a data reader, but this wouldn't work in this case.

Why not? The dataread could ignore the publications from one datawriter (the one with intra-process comms enabled) and not ignore the other. We might need to make the rmw API related to this better.

Aug 23 '21 19:08 wjwwood

Yes, you are right. The second flag (passed on every publication) is not needed.

About the dds option to "ignore local publications", if it's set to true it will ignore everything that comes from the same participant. I think that it does not allow at the moment to ignore publications only from a specific datawriter

Aug 24 '21 12:08 alsora

Could the intra-process machinery be moved to the RMW layer?

Philosophically, this is the approach I would very much like to see taken. I agree with @wjwwood that there are a lot of implementation difficulties, but they are not insurmountable (most of them come down to time-to-implement and maintenance cost) and I think it's worth dealing with them in order to give every language the same access to all available message delivery mechanisms.

However I can see doing it in rclcpp first and in the future refactoring it down to the RMW layer once the kinks have all been worked out.

Aug 25 '21 01:08 gbiggs

I would like to add a perspective to the discussion. I care about having these optimizations available from rcl layer down, since any extra instrumentation in rclcpp has to be re-implemented in other rcl*s. An example which is important to me is Ros2 For Unity which uses rcl interface and cares a lot about communication performance.

Aug 26 '21 13:08 adamdbrw

It's obviously better to make the feature available to more users via rcl, but it is much more work, in my opinion. That trade-off (and whether or not anyone has resources to actually do it properly) is what's at question in my mind. Not whether or not we'd like it to exist. I don't relish the idea of re-implementing type safety, ownership via reference counting, and move semantics through the C API, whereas in rclcpp we get those for free-ish via C++.

Out of curiosity why are you using rcl rather than rclcpp? Isn't it possible to use C++ directly from C#?

Aug 26 '21 20:08 wjwwood

Marshaling templates (and worse, with custom types that we don't necessary know when compiling the core library), basically. There are some other reasons such as memory model incompatibility, but this is the main reason.

Aug 26 '21 20:08 adamdbrw

Let me try to recap all the possible action items mentioned so far.

Fix double delivery problem with Inter- and Intra-process enabled at the same time.
Extend Intra-Process to support services and actions.
Implement missing QoS in intra-process
Improve performance of loaned message APIs
Extend loaned message to support services and actions
Implement a new Intra-process mechanism in rclcpp
Implement a new intra-process mechanism in the rcl/rmw layer

I hope I haven't missed anything. Having said that, this is my opinion:

(4) and (5) should be the main priorities. We don't have any alternative here and these items are needed in order to fully use multi process applications. Depending on the performance improvements, loaned messages could also become the standard and replace intra-process. This would basically remove the need for all other points.
If the performance improvements from (4) are not enough, then (1) should be addressed. From the above discussion it seems doable, moreover the implemented solution will work no matter which future changes are done to Intra-process optimization.
The remaining items (2) (3) (6) (7) are all very intertwined. I agree on the complexity of implemeting intra-process outside rclcpp, however, if there are people who are interested in having this optimization available in other client libraries, I would much rather prefer that they would help moving the current implementation, rather than re-implementing it somewhere else. I think that this is the right time for taking such a decision, i.e. before we make the intra-process optimization feature complete. About changing the intra-process optimization implementation, I think it's something that we can do, as long as we don't incur in any performance overhead with respect to the current one. Despite improvements in the RMW libraries, an implementation similar to the old "mixed" approach would result in additional latency (almost double) and RAM usage (double the number of RMW publishers and subscribers).

Aug 27 '21 17:08 alsora

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/nav2-composition/22175/1

Sep 06 '21 03:09 ros-discourse

What is the status of the plan? Is there an ongoing effort that I could observe or contribute to?

Nov 08 '21 08:11 adamdbrw

@adamdbrw no one is working on it at the moment. Some people are working on extending the existing rclcpp intra-process logic to support custom types, but no one is working on the rcl/rmw layer version of intra-process, as far as I am aware.

Nov 29 '21 16:11 wjwwood

The PR https://github.com/ros2/rclcpp/pull/1847 extends the current rclcpp intra-process to support also communication between clients and services. This would address the point 2 mentioned here: https://github.com/ros2/rclcpp/issues/1750#issuecomment-907354573. I'll soon extend the PR to support also actions.

Dec 17 '21 13:12 mauropasse

No support for dynamic data-types.

Hi @alsora

I am the developer of https://github.com/ZhenshengLee/ros2_shm_msgs which provide an easy way to use zero-copy transport with rclcpp's loaned_msg api.

What this repo do is provide 8k, 512k, 1m, 2m, 4m, 8m bounded size msg types and provide apis (like cv_bridge, pcl_conversions ) to access the msg types. the bridge to rviz is also provided, so the usablility and scalibility would be ensured.

this repo gets some comments, so I'd like to leave questions for you.

is there some way to avoid the copy of initialization of loaned_msg? from https://github.com/MatthiasKillat/ros2_shm_vision_demo/issues/12
what's the currently consideration of zero-copy with dynamic data-types? (I've been searching with google but ...)

looking forward to your reply, thanks!

EDIT: may this(Type Negotiation Feature) be a part of the answer for multiple data-types. https://github.com/ros-infrastructure/rep/blob/master/rep-2009.rst

Jun 21 '22 02:06 ZhenshengLee