cyclonedds icon indicating copy to clipboard operation
cyclonedds copied to clipboard

shared memory with iceoryx appears to use non-compatible topics

Open Aposhian opened this issue 2 years ago • 10 comments

Looking at the output of iox-introspection-client, it looks like iceoryx is trying to transfer topics that are not of fixed width, or have non-compatible history depths. Primarily: /parameter_events. Here is the reproduction https://github.com/fireflyautomatix/nav2-compose/tree/iceoryx

I see the following output:

nav2_1         | [component_container_isolated-4] 1657038794.962033 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038794.962202 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038794.962930 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038794.963038 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
roudi_1        | 2022-07-05 16:33:14.963 [Warning]: Chunk history request exceeds history capacity! Request is 100. Capacity is 1.
nav2_1         | [component_container_isolated-4] 1657038794.963066 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038794.964422 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038794.964529 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.041980 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.042568 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.042714 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.042756 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.145531 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.145619 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.180181 [4] component_: DDS reader with topic rt/parameter_events : iceoryx subscriber - TOO_MANY_CHUNKS_HELD_IN_PARALLEL -could not take sample
nav2_1         | [component_container_isolated-4] 1657038795.202494 [4] component_: Failed to attach iox subscriber to iox listener
nav2_1         | [component_container_isolated-4] [ERROR] [1657038795.206084694] []: Caught exception in callback for transition 10
nav2_1         | [component_container_isolated-4] [ERROR] [1657038795.206105879] []: Original error: could not create service: failed to create reader, at ./src/rmw_node.cpp:4638, at ./src/rcl/service.c:124
nav2_1         | [component_container_isolated-4] [WARN] [1657038795.206128491] []: Error occurred while doing error handling.
nav2_1         | [component_container_isolated-4] [FATAL] [1657038795.206134905] [bt_navigator]: Lifecycle node bt_navigator does not have error state implemented
nav2_1         | [component_container_isolated-4] [ERROR] [1657038795.206432058] [lifecycle_manager_navigation]: Failed to change state for node: bt_navigator
nav2_1         | [component_container_isolated-4] [ERROR] [1657038795.206458312] [lifecycle_manager_navigation]: Failed to bring up all requested nodes. Aborting bringup.

And then I can look at the introspection client and I see this under publishers:

 DDS_CYCLONE      | rcl_interfaces:: | rt/parameter_events   | iceoryx_rt_63_165703879 |                         | INTERNAL
                  |   msg::dds_::Par |                       |   2120795141            |                         |
                  |   ameterEvent_   |                       |                         |                         |

and this under subscribers:

 DDS_CYCLONE      | rcl_interfaces:: | rt/parameter_events   | iceoryx_rt_59_165703879 | iceoryx_rt_59_165703879 | SUBSCRIBED     | WORLDWIDE
                  |   msg::dds_::Par |                       |   2428049023            |   2428049023            |                |
                  |   ameterEvent_   |                       |                         |                         |                |
                  |                  |                       |                         |                         |                |

But shouldn't iceoryx ignore parameter_events which is of a non fixed-size type?

$ ros2 interface show rcl_interfaces/msg/ParameterEvent
# This message contains a parameter event.
# Because the parameter event was an atomic update, a specific parameter name
# can only be in one of the three sets.

# The time stamp when this parameter event occurred.
builtin_interfaces/Time stamp
	int32 sec
	uint32 nanosec

# Fully qualified ROS path to node.
string node

# New parameters that have been set for this node.
Parameter[] new_parameters
	string name
	ParameterValue value
		uint8 type
		bool bool_value
		int64 integer_value
		float64 double_value
		string string_value
		byte[] byte_array_value
		bool[] bool_array_value
		int64[] integer_array_value
		float64[] double_array_value
		string[] string_array_value

# Parameters that have been changed during this event.
Parameter[] changed_parameters
	string name
	ParameterValue value
		uint8 type
		bool bool_value
		int64 integer_value
		float64 double_value
		string string_value
		byte[] byte_array_value
		bool[] bool_array_value
		int64[] integer_array_value
		float64[] double_array_value
		string[] string_array_value

# Parameters that have been deleted during this event.
Parameter[] deleted_parameters
	string name
	ParameterValue value
		uint8 type
		bool bool_value
		int64 integer_value
		float64 double_value
		string string_value
		byte[] byte_array_value
		bool[] bool_array_value
		int64[] integer_array_value
		float64[] double_array_value
		string[] string_array_value

Also says that a history depth of 100 is being requested, but it is actually 1000 on parameter events. Unless that log print is for another topic? And why does it say its capacity is 1, when the default for publishers is 16 (https://github.com/eclipse-iceoryx/iceoryx/blob/master/iceoryx_posh/cmake/IceoryxPoshDeployment.cmake#L56)?

I'm posting this here since I am using Eclipse CycloneDDS from ros2, and I'm assuming that selecting candidate topics for shared mem belongs with CycloneDDS, but maybe I'm incorrect.

Aposhian avatar Jul 05 '22 16:07 Aposhian

@Aposhian What happens currently when SHM is enabled (also with rmw_cyclonedds) is all types (including dynamic types like string/vector) will be transferred using SHM, but dynamic types (that don't have fixed size) will be serialized into the shared memory. Of course, this will have serialization overhead but might still be beneficial as we bypass the network stack.

Also says that a history depth of 100 is being requested, but it is actually 1000 on parameter events. Unless that log print is for another topic? And why does it say its capacity is 1, when the default for publishers is 16 (https://github.com/eclipse-iceoryx/iceoryx/blob/master/iceoryx_posh/cmake/IceoryxPoshDeployment.cmake#L56)?

When using SHM the history capacity for the iceoryx publisher is configured as follows:

  • with volatile durability, the history capacity is set to 0
  • with transient local, the history capacity is set to what is requested for the history depth. The history depth for transient local is controlled by DurabilityService which has a default history depth of 1.

From the log output, it seems that the publisher is using transient local durability and because of this you see 1 as the history capacity.

I am not sure where 100 is coming from, probably this is the history depth when creating the rclcpp::subscription for ParameterEvents?


I think the default number of chunks allowed to hold simultaneously is 256, if the application is holding more than these samples, then you get the TOO_MANY_CHUNKS_HELD_IN_PARALLEL error. Is it the use-case that the application holds all the samples of parameter events?

CC @eboasson @MatthiasKillat

sumanth-nirmal avatar Jul 07 '22 07:07 sumanth-nirmal

Thank you for explaining those points. And I am excited to learn that dynamically sized types are supported. Maybe this section should be updated in the docs?

First, the data types to be exchanged need to have a fixed size. This precludes the use of strings and sequences at any level in the data type, though this does not prevent the use of arrays, as their size is fixed at compile time. If any of these types of member variables are encountered in the IDL code generating the data types, shared memory exchange is disabled.

Could you help me understand what it means to hold chunks?

Also, here is the full output of ros2 topic info --verbose /parameter_events when this example system is running. It looks like every ROS node is both publishing to and subscribing to /parameter_events: ros2_topic_info_parameter_events.txt

Aposhian avatar Jul 07 '22 15:07 Aposhian

@Aposhian I did not look at the details of your report yet. As for holding a chunk when using shared memory.

When a sample is sent the following happens.

  1. the writer acquires a shared memory chunk
  2. the sample is written to the chunk (zero-copy initialized in-place or copied)
  3. the sample is passed to the reader history cache

Now the reader history cache owns the sample and will keep it until it is taken by the consumer client or it is evicted for more recent samples (depends on reader history cache size). Note that when the sample is taken by the user it will in some cases perform a copy and return the chunk or it may hold on to the chunk depending on the API (which is now owned by the client). Once the client is done with the sample, it has to return it with return_loan (or similar).

Now for technical reasons iceoryx has a limit on shared memory chunks the client can get from one reader of a particular topic which is currently 256. If we try to get an more then we get an internal error in the log (unfortunately not very visible) that the particular subscriber cannot provide more chunks (TOO_MANY_CHUNKS_HELD_IN_PARALLEL).

The reason for this limit is artificial: it limits the problems if a consumer takes samples but does not return them, which would cause the system to run out of memory quickly. Note that this can even happen by accident since CycloneDDS is basically a consumer to iceoryx shared memory and the samples in the reader history cache count against this limit. Therefore, if the sample cache is > 256 and samples arrive much faster than they are consumed we also may encounter this problem.

We would like to get rid of this limitation or at least mitigate it somewhat but it is not trivial with the current design.

Let me know whether this explains some of the problems. I think the first step is to make iceoryx configuration more transparent, but this has downsides as well. Ideally a middleware user should not be concerned about iceoryx if he wants to use shared memory (which is already not the case as she needs a memory configuration and RouDi).

MatthiasKillat avatar Jul 08 '22 15:07 MatthiasKillat

Hi @sumanth-nirmal

What happens currently when SHM is enabled (also with rmw_cyclonedds) is all types (including dynamic types like string/vector) will be transferred using SHM, but dynamic types (that don't have fixed size) will be serialized into the shared memory. Of course, this will have serialization overhead but might still be beneficial as we bypass the network stack.

Does this mean that non-fixed size data can be transferred using iceoryx (zero copy)? Is it possible for me to transfer google protobuf(or capnproto)? Related issues #1327

xixioba avatar Jul 12 '22 01:07 xixioba

Thanks @sumanth-nirmal and @MatthiasKillat for checking this:

Therefore, if the sample cache is > 256 and samples arrive much faster than they are consumed we also may encounter this problem.

We would like to get rid of this limitation or at least mitigate it somewhat but it is not trivial with the current design.

Somehow the 256 sample limit escaped my attention and I would have taken me a lot more time to find out that detail than it took you. What does worry me in this particular case is that I would not expect:

  • very many parameter updates (I would expect varying data to modelled as a topic, not as a parameter); and
  • parameter updates to stay in the Cyclone DDS reader history cache for very long (because I would expect the ROS 2 core to handle them promptly, 'cos what else is the usefulness of them?)

@Aposhian, any chance you could find out whether there are a great many parameter updates and/or they take forever to be applied?

@xixioba

Does this mean that non-fixed size data can be transferred using iceoryx (zero copy)?

Non-fixed size data can be transferred via iceoryx, but, as @sumanth-nirmal remarked, it won't be zero-copy.

The fixed-size requirement comes arises from the mapping of sequences and strings to C++: std::vector and std::string rely on pointers in shared memory and manipulating them in a way that we can't support via iceoryx. It is not fundamentally impossible, just almost impossible within standard C++.

Is it possible for me to transfer google protobuf(or capnproto)?

Cyclone is built to support alternative representations, but we currently don't have support for transparently shovelling protobuf data around. If you're willing to abuse interfaces a bit and can tolerate a memcpy I suspect you could abuse the "serialized" interfaces of ROS 2, but it'd be a hack. We don't have the bandwidth currently to work on proper support for protobuf, but I'd be willing to answer questions if you'd want to try your hand at it.

eboasson avatar Jul 12 '22 07:07 eboasson

@eboasson Thanks, I have tried to write test code for protobuf over dds, in fact, zero copy support for non-fixed size data may be more important

xixioba avatar Jul 12 '22 07:07 xixioba

Looking at the rclcpp source, it looks like since start_parameter_event_publisher is set to true by default on rclcpp::NodeOptions that means that any time Node::declare_parameter or Node::set_parameter is called, a parameter event is published. Now, when a node loads parameters from a YAML file, I believe it calls Node::declare_parameter for each of those parameters, which on startup will flood that topic. I will create an issue on rclcpp to make sure I am understanding correctly. Because that doesn't seem to be a very desirable behavior.

Aposhian avatar Jul 12 '22 12:07 Aposhian

~~Looking closer, this flooding behavior on startup is from Nav2 setting automatically_declare_parameters_from_overrides to true on all of its nodes.~~

EDIT: Nav2 does not set automatically_declare_parameters_from_overrides to true on all of its nodes (but I was confused because of https://github.com/ros-planning/navigation2/issues/3072). Here are the messages that were received from my example on startup. parameter_events_messages.txt

I now see that the messages received were for parameters that were declared inside in the nodes, but weren't present inside the YAML.

Aposhian avatar Jul 12 '22 12:07 Aposhian

What does worry me in this particular case is that I would not expect:

  • very many parameter updates (I would expect varying data to modelled as a topic, not as a parameter); and
  • parameter updates to stay in the Cyclone DDS reader history cache for very long (because I would expect the ROS 2 core to handle them promptly, 'cos what else is the usefulness of them?)

@eboasson what that means is that this example should be typical of a large ROS2 system. Large numbers of publishes to /parameter_events on startup. And I don't even know if there are any readers (subscriptions) on /parameter_events in this system. So wouldn't it be the writer cache that is getting filled up, rather than the reader cache?

Aposhian avatar Jul 12 '22 12:07 Aposhian

Related: https://github.com/ros2/rclcpp/issues/1970

Aposhian avatar Jul 12 '22 13:07 Aposhian

This thread has significantly branched from its original purpose, with quite some details and maybe-problems that are not all in Cyclone. Therefore you should not interpret this close as a "everything is fine" but as a "this ticket cannot be dealt with in this state". Please open separate issues for actual bugs and questions about Cyclone DDS.

thijsmie avatar Dec 07 '22 09:12 thijsmie