rmw_fastrtps ROS2 image topic publisher freezes when a subscriber is initiated at a remote PC

Bug report

My aim was to send a video stream as a ros-topic to a remote PC (Raspberry Pi 4). To do this, I published an image stream using the cam2image node of image_tools package

At the remote PC (which was connected using WiFi), when I initiated the subscriber using showimage node of the same package, the cam2image node froze as shown in this video.

This issue was posted here before and from there I learned that this behavior doesn't occur with cyclonedds. Hence posting this here.

Required Info:

Operating System:
- Ubuntu 20.04 (in docker image: ros:foxy-ros-base)
DDS implementation:
- Fast-RTPS

Steps to reproduce issue

Install image_tools in both host and remote PC

sudo apt update
sudo apt install ros-foxy-image-tools

In the host PC

ros2 run image_tools cam2image --ros-args -p burger_mode:=true -p frequency:=10. -p reliability:=best_effort

In the remote PC

ros2 run image_tools showimage --ros-args -p show_image:=false -p reliability:=best_effort

Expected behavior

The host PC publishes at 10 Hz, and the remote PC subscribes at the same frequency.

Actual behavior

The publisher node freezes when the subscriber is initiated.

Additional information

Also, the publisher node continues to operate normally once the subscriber node is killed

Oct 08 '20 14:10 atb033

Hi @atb033,

I have tried to reproduce your issue following your steps without success. The following configurations were used:

Both applications run locally in the same machine (Ubuntu 20.04).
Each application run in different machines (Ubuntu 20.04) and using WiFi. This configuration was also tried with one machine running the application within the ros:foxy-ros-base docker.
Finally, both applications running using the docker, using as remote PC a Raspberry Pi 4 connected by WiFi to the host PC.

Could you please provide us some more information so we can continue to study your issue?

Oct 13 '20 10:10 JLBuenoLopez

Hi @JLBuenoLopez-eProsima,

One point that I forgot to mention here is that my host PC is actually running natively on Ubuntu 16.04. The ros:foxy-ros-base docker (which is ran on the host PC) is based on Ubuntu 20.04. Do you think this will have any influence on this behavior?

Oct 13 '20 14:10 atb033

@JLBuenoLopez-eProsima

i can reproduce this issue with,

Raspi4 / Ubuntu20.04 docker with foxy or source build. (showimage as subscriber)
x86_64 / Ubuntu20.04 docker with foxy or source build. (cam2image as publisher)

note: https://github.com/ros2/rclcpp/issues/1335#issuecomment-705579688

Oct 13 '20 15:10 fujitatomoya

Thanks @fujitatomoya, @atb033

I will keep trying. Could you please provide me with the OS that are running on both machines. Yesterday I tried with:

Raspi4 / Ubuntu20.04 OS / Ubuntu20.04 docker / foxy binaries
x86_64 / Ubuntu20.04 OS / Ubuntu20.04 docker / foxy binaries

I am installing a Virtual Machine with Ubuntu16.04 to run the docker (following the last comment from @atb033) to try again but I am unsure about the Raspi4 setup that you are using.

Thanks for your help!

Oct 14 '20 05:10 JLBuenoLopez

I use Ubuntu20.04 container(--net=host) on Ubuntu20.04 physical env w/o Vitrual Machines for both remote and host PC.

Oct 14 '20 06:10 fujitatomoya

Thanks again for the information, @fujitatomoya!

I have tried that same configuration with foxy binaries and I have been unable to reproduce the issue. I am trying with sources just in case.

Could you please provide a network traffic capture of the RTPS packages at both ends so we can try to analyze what is happening? Thanks again!

Oct 14 '20 07:10 JLBuenoLopez

Could you please provide a network traffic capture of the RTPS packages at both ends so we can try to analyze what is happening?

since this environment is secured, i am afraid to do this, sorry. @atb033 how about you?

Oct 15 '20 02:10 fujitatomoya

@JLBuenoLopez-eProsima

The following is my setup:

Raspi4 / Ubuntu20.04 OS / Ubuntu20.04 docker / foxy binaries
x86_64 / Ubuntu16.04 OS / Ubuntu20.04 docker / foxy binaries

I was able to replicate the bug with this setup again, and I am attaching the network traffic capture here:

image_publish_rtps.pcapng.zip

Oct 16 '20 14:10 atb033

Thanks, @atb033, for sending us the traffic capture!

It seems that the net is being overloaded and the reason is the ICMP Destination unreachable (Port unreachable) package that is received after the DataWriter is matched with the DataReader and the images start to be sent.

This net overload also seems to be causing the write operation from the DataWriter to enter a deadlock, following the description of your issue. This could be explained as follows: by default ROS 2 configures the DataWriter publish mode as ASYNCHRONOUS as explained here. Consequently, the asynchronous thread will wait until the write operation is finished. On the other hand, the sending buffer could be completely filled and the write operation is probably waiting for the buffer to be free to write the new data.

Therefore, could you try first to set the DataWriter as SYNCHRONOUS and tell us if this is enough to fix your issue? Not having an asynchronous thread that could be deadlocked with the write operation, the DataWriter should not stop publishing even though the net is overloaded.

If this is not enough, we advise you to set the non_blocking_send flag. By default this flag is set to false and send operations will block until the network buffer has space for the datagram. Setting the flag to true makes the send operation to return immediately if the buffer is full so the application will behave as if the datagram is sent and lost (more information in our documentation.

Finally, you may consider decreasing the maxMessageSize of your transport, reducing the size of your packages and preventing the Fragmented IP protocol to have to handle those large packages (as explained here).

Please, let us know if this is enough to solve your issue.

Oct 20 '20 05:10 JLBuenoLopez

@JLBuenoLopez-eProsima

Thanks for the input. I can't implement these immediately as I am caught up with some other work at this moment.

I'll get back to you soon after testing these out.

Oct 22 '20 08:10 atb033

Hey @JLBuenoLopez-eProsima

I tested all the approaches that you recommended and still the problem persists. The following are the settings that I used. Can you please go through them and tell me if I had done it correctly?

SYNCHRONOUS

<!-- SYNCHRONOUS mode -->
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>
        <publisher profile_name="publisher profile" is_default_profile="true">
            <qos>
                <publishMode>
                    <kind>SYNCHRONOUS</kind>
                </publishMode>
            </qos>
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>

        <subscriber profile_name="subscriber profile" is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </subscriber>
    </profiles>

</dds>

non_blocking set to true

<!-- non-blocking true mode -->
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>

        <transport_descriptors>
            <transport_descriptor>
                <transport_id>test</transport_id>
                <type>UDPv4</type>
                <non_blocking_send>true</non_blocking_send>
            </transport_descriptor>
        </transport_descriptors>

        <publisher profile_name="publisher profile" is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
        <subscriber profile_name="subscriber profile" is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </subscriber>

    </profiles>

</dds>

Reducing maxMessageSize

<!-- Reduce maxMessageSize -->
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>

        <transport_descriptors>
            <transport_descriptor>
                <transport_id>test</transport_id>
                <type>UDPv4</type>
                <maxMessageSize > 5500 </maxMessageSize>
            </transport_descriptor>
        </transport_descriptors>

        <publisher profile_name="publisher profile" is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
        <subscriber profile_name="subscriber profile" is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </subscriber>

    </profiles>

</dds>

Oct 23 '20 13:10 atb033

Thanks @atb033,

First, I would like to know if you set the environment variable RMW_FASTRTPS_USE_QOS_FROM_XML to 1 when you used the synchronous option. As explained in the README, if this variable is not set, the History memory policy and Publishing mode are not going to be read from the XML file and instead, the ones preconfigured in the rmw layer are going to be used.

Therefore, be sure that you run the following commands:

Publisher side:

FASTRTPS_DEFAULT_PROFILES_FILE=<path_to_xml_file> RMW_FASTRTPS_USE_QOS_FROM_XML=1 RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run image_tools cam2image --ros-args -p burger_mode:=true -p frequency:=10. -p reliability:=best_effort

Subscriber side:

FASTRTPS_DEFAULT_PROFILES_FILE=<path_to_xml_file> RMW_FASTRTPS_USE_QOS_FROM_XML=1 RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run image_tools showimage --ros-args -p show_image:=false -p reliability:=best_effort

NOTE: FASTRTPS_DEFAULT_PROFILES_FILE environment variable does not need to be set if the XML file is in the running directory under the name DEFAULT_FASTRTPS_PROFILES.xml.

On the other hand, even though you are setting the options for a new transport, you are not linking this custom transport to your participant. You can use the following XML file were all three options suggested have been included:

<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <profiles>

        <transport_descriptors>
            <transport_descriptor>
                <transport_id>test</transport_id>
                <type>UDPv4</type>
                <maxMessageSize>5500</maxMessageSize>
                <non_blocking_send>true</non_blocking_send>
            </transport_descriptor>
        </transport_descriptors>

        <participant profile_name="participant profile" is_default_profile="true">
            <rtps>
                <userTransports>
                    <transport_id>test</transport_id>
                </userTransports>
                <useBuiltinTransports>false</useBuiltinTransports>
            </rtps>
        </participant>

        <publisher profile_name="publisher profile" is_default_profile="true">
             <qos>
                <publishMode>
                    <kind>SYNCHRONOUS</kind>
                </publishMode>
            </qos>
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </publisher>
        <subscriber profile_name="subscriber profile" is_default_profile="true">
            <historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
        </subscriber>

    </profiles>

</dds>

I hope that this solves your issue.

Finally, if you do not mind, it would be helpful if you can try each option sequentially and tell us if it is enough to solve the issue:

Remove the custom transport and try only the synchronous publishing mode.
Use the synchronous publishing mode with the non_blocking_send option.
Use previous configuration adding to the custom transport the maxMessageSize option.

This will provide us more information about your issue, as we have been unable to reproduce it.

Oct 26 '20 08:10 JLBuenoLopez

it's been 3 years so that i checked if this is still reproducible with current container images ros:rolling, ros:iron and ros:humble. and it turned out i cannot reproduce this issue with those containers, so i will go ahead to close this issue. please feel free to reopen the issue if you still have the problem.

Test Platform

Raspi4 / Host Ubuntu20.04, docker ros:rolling, ros:iron and ros:humble. (showimage as subscriber)
x86_64 / Host Ubuntu20.04, docker ros:rolling, ros:iron and ros:humble. (cam2image as publisher)

Result

Subscription running on Raspi4 keeps receiving image data, and publisher does not freeze when the subscriber is initiated.

Console Output

root@tomoyafujita:/# ros2 run image_tools cam2image --ros-args -p burger_mode:=true -p frequency:=10. -p reliability:=best_effort
...<snip>
[INFO] [1694125748.789014474] [cam2image]: Publishing image #3647
[INFO] [1694125748.889001996] [cam2image]: Publishing image #3648
[INFO] [1694125748.989015319] [cam2image]: Publishing image #3649
[INFO] [1694125749.089007292] [cam2image]: Publishing image #3650
[INFO] [1694125749.189011144] [cam2image]: Publishing image #3651
^C[INFO] [1694125749.261280833] [rclcpp]: signal_handler(signum=2)

root@raspi4-1:/# ros2 run image_tools showimage --ros-args -p show_image:=false -p reliability:=best_effort
...<snip>
[INFO] [1694125747.596249367] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.696080180] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.795356557] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.896058764] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.996081833] [showimage]: Received image #camera_frame
Received image #camera_frame

Sep 07 '23 22:09 fujitatomoya

It has been quite a while, that's right, but unfortunately the issue persists. Publishing (ROS2 Foxy/Iron same thing) sensor_msgs/Image message without any subscribers works at ~20Hz which is expected value. Using image viewer in rqt does not in fact slow down rate of publications which is verified by ros2 topic hz /image_raw, and video smoothness in locally ran rqt.

Just creating a subscription on another PC (Matlab 2024a - ROS Humble) located in the same local network does in fact slow down publication rate which can be again verified using the same two methods, frequency drops to about 1.5Hz and there is significant smoothness drop.

Deleting subscriber restores smoothness and original frequency. QoS reliability used in both publisher and subscriber is set to best effort, however I also checked another option where publisher is configured using rclcpp::SensorQoS, where queue size could be set to 1 or 100 on both ends without any effect.

Apr 27 '24 10:04 rosiakpiotr