rmw_fastrtps icon indicating copy to clipboard operation
rmw_fastrtps copied to clipboard

Running a subscriber on Humble with publishing from Jazzy / Rolling uses up all memory

Open urfeex opened this issue 11 months ago • 7 comments

Bug report

While I am aware that inter-distribution traffic isn't supported, I would at least expect systems not to crash if that occurs. However, we noticed that running both, Humble and Jazzy nodes on the same network can cause the machines running Humble nodes to run out of memory, probably because of discovery traffic. It is sufficient to have a Humble subscriber and do a ros2 topic list from Jazzy to (sometimes) trigger the issue.

Required Info:

  • Operating System: Ubuntu 22.04 / 24.04
  • Installation type: binary
  • Version or commit hash: latest
  • DDS implementation:
    • default -> fastrtps
  • Client library (if applicable):
    • Checked using rclpy

Steps to reproduce issue

The following docker-compose file illustrates the issue. Running this will make your system run out of memory!!!

---
version: '2'

networks:
  rosdocker:
    driver: bridge

services:
  talker:
    image: ros:jazzy
    container_name: lister
    hostname: lister
    networks:
      - rosdocker
    environment:
      - "ROS_DOMAIN_ID=13"
    command: "ros2 topic list"
    restart: always  # It seems not to happen every time, hence the restart

  listener:
    image: ros:humble
    container_name: listener
    hostname: listener
    networks:
      - rosdocker
    environment:
      - "ROS_DOMAIN_ID=13"
    command: "ros2 topic echo /chatter std_msgs/String"

Expected behavior

Error messages, silent ignores, or magically just work. Note: As said earlier, I do not expect cross-distro communication to simply work, but I would expect stable behavior.

Actual behavior

The ros2 topic echo seems to cause unlimited memory consumption and makes the system run OOM in a matter of seconds.

Additional information

  • We haven't investigated things further down RMW code, but we do see the error output caught in https://github.com/ros2/rmw_fastrtps/pull/737 on the humble node.
  • I have tried the same thing with rmw_cyclone_cpp which throws warnings on the list and errors on the echo but doesn't crash or end up in OOM. So, this seems to be a rmw_fastrtps_cpp issue.

urfeex avatar Jan 15 '25 09:01 urfeex

@urfeex thanks for creating issue.

Note: As said earlier, I do not expect cross-distro communication to simply work, but I would expect stable behavior.

as you already mentioned, cross-distro communication is not supported. (interfaces are not guaranteed compatible, including breaking ABI/API) that said, unfortunately we do not really expect the stable behavior...

you can keep this open, but i do not investigate any further on this issue.

fujitatomoya avatar Jan 15 '25 17:01 fujitatomoya

As I said earlier, this issue is not about cross-distro communication. It's about taking down a complete computer as soon as some random jazzy / rolling node shows up on the network sending discovery messages.

In my opinion it would be fine to silently ignore those, print errors all over the place, whatever. But having a PC use up all memory seems not like something that should be considered "just not supported". It was my impression that #737 was created also with the motivation to prevent crashes because of that scenario.

urfeex avatar Jan 17 '25 09:01 urfeex

As I said earlier, this issue is not about cross-distro communication.

i think what you mean is application data-plane.

as soon as some random jazzy / rolling node shows up on the network sending discovery messages.

ROS 2 already communicate in discovery, so cross-distro communication is taking place in discovery to develop the endpoint connectivity.

In my opinion it would be fine to silently ignore those, print errors all over the place, whatever. But having a PC use up all memory seems not like something that should be considered "just not supported".

good point, totally agree this.

one thing i would like to ask you as a possible work-around. can you set the different ROS_DOMAIN_ID for jazzy and rolling? https://docs.ros.org/en/eloquent/Tutorials/Configuring-ROS2-Environment.html#the-ros-domain-id-variable

this should provide the logical partition for the discovery process, that means no discovery between jazzy and rolling at all.

fujitatomoya avatar Jan 17 '25 16:01 fujitatomoya

ROS 2 already communicate in discovery, so cross-distro communication is taking place in discovery to develop the endpoint connectivity.

Yes, that is clear to me. What I wanted to say is: We do not try to actively do any cross-distro communication or expect any cross-distro communication to work. We just want systems not to go down because of a participant in the same domain ID gets active on the same network. But I think that has become clear by now :-)

one thing i would like to ask you as a possible work-around. can you set the different ROS_DOMAIN_ID for jazzy and rolling?

Yes, setting the ROS_DOMAIN_ID has been identified as a workaround already, I should have mentioned that. Unfortunately, this only makes it less likely to happen.

In my opinion it would be fine to silently ignore those, print errors all over the place, whatever. But having a PC use up all memory seems not like something that should be considered "just not supported".

good point, totally agree this.

Does that mean you think searching for a solution for this might be the way to go? Can we support this in any way? I cannot promise any resources at the moment, though.

urfeex avatar Jan 17 '25 16:01 urfeex

What I wanted to say is: We do not try to actively do any cross-distro communication or expect any cross-distro communication to work. We just want systems not to go down because of a participant in the same domain ID gets active on the same network.

yeah, this is not good user-experience, silently causing the problem. if that is not supported, disallow / warning notification would be much better for user.

Yes, setting the ROS_DOMAIN_ID has been identified as a workaround already, I should have mentioned that.

no worries, good to know that works.

Does that mean you think searching for a solution for this might be the way to go?

i do not think so, as far as i know there is nobody is planning for that.

fujitatomoya avatar Jan 17 '25 17:01 fujitatomoya

We discussed this recently, and we came up with a couple of different ideas:

  1. Humble does not have type hashes as part of types, while Iron and later do have type hashes. It may be possible for Humble to detect that type hashes exist, and if that is the case, refuse to connect to a peer.
  2. Following on from above, it may be possible for Jazzy and later to detect peers without type hashes, and refuse to connect to them.
  3. Going forward, it might be a good idea to make discovery localhost-only by default. We actually have all of the configuration options available to make this happen (see https://github.com/ros2/ros2/issues/1359), but the last time we tried to enable it by default it failed some tests. If we can fix those tests, then making things localhost-only by default may be a good idea.

Finally, following up on 3), it should be enough here to set ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST to workaround the problem.

clalancette avatar Jan 30 '25 18:01 clalancette

Thank you for following up on this. Refusing connections from Jazzy or later on Humble sounds like exactly what we would need.

Regarding ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST: That is not really an option for our use case as we do require communication with other network participants. If we had the same discovery range options available as on Jazzy and later that might be a suitable mitigation in many cases, as we require ROS communication mostly on an internal network rather than the one facing the external world.

Nevertheless, we do offer the option to make the ROS interfaces available externally in which case the devices would be vulnerable to the problem explained above and restricting the discovery range to a certain subnet or similar will not be suitable.

Summing up: if 1) and 2) from https://github.com/ros2/rmw_fastrtps/issues/797#issuecomment-2625280580 could be implemented that would help tremendously.

urfeex avatar Jan 31 '25 05:01 urfeex

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/incompatability-between-distributions/43747/8

ros-discourse avatar May 15 '25 08:05 ros-discourse

See https://github.com/ros2/rosidl_typesupport_fastrtps/pull/130 for a potential fix

MiguelCompany avatar May 21 '25 10:05 MiguelCompany

use same dds profile humble and jazzy environment: - RMW_IMPLEMENTATION=rmw_cyclonedds_cpp - CYCLONEDDS_URI=file:///dds_ws/cyclone_dds_profile.xml

debanik123 avatar Aug 29 '25 06:08 debanik123

@debanik123 This issue is with rmw_fastrtps_cpp, so your comment does not apply.

@urfeex has already checked that ros2/rosidl_typesupport_fastrtps#133 would fix the issue

MiguelCompany avatar Aug 29 '25 06:08 MiguelCompany

Closing this now that https://github.com/ros2/rosidl_typesupport_fastrtps/pull/130 and all related PRs have been merged and backported to all the supported distributions.

Feel free to reopen if still an issue.

MiguelCompany avatar Sep 03 '25 05:09 MiguelCompany