ROS2 image topic publisher freezes when a subscriber is initiated at a remote PC
Bug report
My aim was to send a video stream as a ros-topic to a remote PC (Raspberry Pi 4). To do this, I published an image stream using the cam2image node of image_tools package
At the remote PC (which was connected using WiFi), when I initiated the subscriber using showimage node of the same package, the cam2image node froze as shown in this video.
This issue was posted here before and from there I learned that this behavior doesn't occur with cyclonedds. Hence posting this here.
Required Info:
- Operating System:
- Ubuntu 20.04 (in docker image:
ros:foxy-ros-base)
- Ubuntu 20.04 (in docker image:
- DDS implementation:
- Fast-RTPS
Steps to reproduce issue
Install image_tools in both host and remote PC
sudo apt update
sudo apt install ros-foxy-image-tools
In the host PC
ros2 run image_tools cam2image --ros-args -p burger_mode:=true -p frequency:=10. -p reliability:=best_effort
In the remote PC
ros2 run image_tools showimage --ros-args -p show_image:=false -p reliability:=best_effort
Expected behavior
The host PC publishes at 10 Hz, and the remote PC subscribes at the same frequency.
Actual behavior
The publisher node freezes when the subscriber is initiated.
Additional information
Also, the publisher node continues to operate normally once the subscriber node is killed
Hi @atb033,
I have tried to reproduce your issue following your steps without success. The following configurations were used:
- Both applications run locally in the same machine (Ubuntu 20.04).
- Each application run in different machines (Ubuntu 20.04) and using WiFi. This configuration was also tried with one machine running the application within the
ros:foxy-ros-basedocker. - Finally, both applications running using the docker, using as remote PC a Raspberry Pi 4 connected by WiFi to the host PC.
Could you please provide us some more information so we can continue to study your issue?
Hi @JLBuenoLopez-eProsima,
One point that I forgot to mention here is that my host PC is actually running natively on Ubuntu 16.04. The ros:foxy-ros-base docker (which is ran on the host PC) is based on Ubuntu 20.04. Do you think this will have any influence on this behavior?
@JLBuenoLopez-eProsima
i can reproduce this issue with,
- Raspi4 / Ubuntu20.04 docker with foxy or source build. (showimage as subscriber)
- x86_64 / Ubuntu20.04 docker with foxy or source build. (cam2image as publisher)
note: https://github.com/ros2/rclcpp/issues/1335#issuecomment-705579688
Thanks @fujitatomoya, @atb033
I will keep trying. Could you please provide me with the OS that are running on both machines. Yesterday I tried with:
- Raspi4 / Ubuntu20.04 OS / Ubuntu20.04 docker / foxy binaries
- x86_64 / Ubuntu20.04 OS / Ubuntu20.04 docker / foxy binaries
I am installing a Virtual Machine with Ubuntu16.04 to run the docker (following the last comment from @atb033) to try again but I am unsure about the Raspi4 setup that you are using.
Thanks for your help!
I use Ubuntu20.04 container(--net=host) on Ubuntu20.04 physical env w/o Vitrual Machines for both remote and host PC.
Thanks again for the information, @fujitatomoya!
I have tried that same configuration with foxy binaries and I have been unable to reproduce the issue. I am trying with sources just in case.
Could you please provide a network traffic capture of the RTPS packages at both ends so we can try to analyze what is happening? Thanks again!
Could you please provide a network traffic capture of the RTPS packages at both ends so we can try to analyze what is happening?
since this environment is secured, i am afraid to do this, sorry. @atb033 how about you?
@JLBuenoLopez-eProsima
The following is my setup:
- Raspi4 / Ubuntu20.04 OS / Ubuntu20.04 docker / foxy binaries
- x86_64 / Ubuntu16.04 OS / Ubuntu20.04 docker / foxy binaries
I was able to replicate the bug with this setup again, and I am attaching the network traffic capture here:
Thanks, @atb033, for sending us the traffic capture!
It seems that the net is being overloaded and the reason is the ICMP Destination unreachable (Port unreachable) package that is received after the DataWriter is matched with the DataReader and the images start to be sent.
This net overload also seems to be causing the write operation from the DataWriter to enter a deadlock, following the description of your issue. This could be explained as follows: by default ROS 2 configures the DataWriter publish mode as ASYNCHRONOUS as explained here. Consequently, the asynchronous thread will wait until the write operation is finished. On the other hand, the sending buffer could be completely filled and the write operation is probably waiting for the buffer to be free to write the new data.
Therefore, could you try first to set the DataWriter as SYNCHRONOUS and tell us if this is enough to fix your issue? Not having an asynchronous thread that could be deadlocked with the write operation, the DataWriter should not stop publishing even though the net is overloaded.
If this is not enough, we advise you to set the non_blocking_send flag. By default this flag is set to false and send operations will block until the network buffer has space for the datagram. Setting the flag to true makes the send operation to return immediately if the buffer is full so the application will behave as if the datagram is sent and lost (more information in our documentation.
Finally, you may consider decreasing the maxMessageSize of your transport, reducing the size of your packages and preventing the Fragmented IP protocol to have to handle those large packages (as explained here).
Please, let us know if this is enough to solve your issue.
@JLBuenoLopez-eProsima
Thanks for the input. I can't implement these immediately as I am caught up with some other work at this moment.
I'll get back to you soon after testing these out.
Hey @JLBuenoLopez-eProsima
I tested all the approaches that you recommended and still the problem persists. The following are the settings that I used. Can you please go through them and tell me if I had done it correctly?
-
SYNCHRONOUS
<!-- SYNCHRONOUS mode -->
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<publisher profile_name="publisher profile" is_default_profile="true">
<qos>
<publishMode>
<kind>SYNCHRONOUS</kind>
</publishMode>
</qos>
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<subscriber profile_name="subscriber profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
</profiles>
</dds>
-
non_blockingset to true
<!-- non-blocking true mode -->
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<transport_descriptors>
<transport_descriptor>
<transport_id>test</transport_id>
<type>UDPv4</type>
<non_blocking_send>true</non_blocking_send>
</transport_descriptor>
</transport_descriptors>
<publisher profile_name="publisher profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<subscriber profile_name="subscriber profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
</profiles>
</dds>
- Reducing maxMessageSize
<!-- Reduce maxMessageSize -->
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<transport_descriptors>
<transport_descriptor>
<transport_id>test</transport_id>
<type>UDPv4</type>
<maxMessageSize > 5500 </maxMessageSize>
</transport_descriptor>
</transport_descriptors>
<publisher profile_name="publisher profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<subscriber profile_name="subscriber profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
</profiles>
</dds>
Thanks @atb033,
First, I would like to know if you set the environment variable RMW_FASTRTPS_USE_QOS_FROM_XML to 1 when you used the synchronous option. As explained in the README, if this variable is not set, the History memory policy and Publishing mode are not going to be read from the XML file and instead, the ones preconfigured in the rmw layer are going to be used.
Therefore, be sure that you run the following commands:
- Publisher side:
FASTRTPS_DEFAULT_PROFILES_FILE=<path_to_xml_file> RMW_FASTRTPS_USE_QOS_FROM_XML=1 RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run image_tools cam2image --ros-args -p burger_mode:=true -p frequency:=10. -p reliability:=best_effort
- Subscriber side:
FASTRTPS_DEFAULT_PROFILES_FILE=<path_to_xml_file> RMW_FASTRTPS_USE_QOS_FROM_XML=1 RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run image_tools showimage --ros-args -p show_image:=false -p reliability:=best_effort
NOTE: FASTRTPS_DEFAULT_PROFILES_FILE environment variable does not need to be set if the XML file is in the running directory under the name DEFAULT_FASTRTPS_PROFILES.xml.
On the other hand, even though you are setting the options for a new transport, you are not linking this custom transport to your participant. You can use the following XML file were all three options suggested have been included:
<dds xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<profiles>
<transport_descriptors>
<transport_descriptor>
<transport_id>test</transport_id>
<type>UDPv4</type>
<maxMessageSize>5500</maxMessageSize>
<non_blocking_send>true</non_blocking_send>
</transport_descriptor>
</transport_descriptors>
<participant profile_name="participant profile" is_default_profile="true">
<rtps>
<userTransports>
<transport_id>test</transport_id>
</userTransports>
<useBuiltinTransports>false</useBuiltinTransports>
</rtps>
</participant>
<publisher profile_name="publisher profile" is_default_profile="true">
<qos>
<publishMode>
<kind>SYNCHRONOUS</kind>
</publishMode>
</qos>
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</publisher>
<subscriber profile_name="subscriber profile" is_default_profile="true">
<historyMemoryPolicy>PREALLOCATED_WITH_REALLOC</historyMemoryPolicy>
</subscriber>
</profiles>
</dds>
I hope that this solves your issue.
Finally, if you do not mind, it would be helpful if you can try each option sequentially and tell us if it is enough to solve the issue:
- Remove the custom transport and try only the synchronous publishing mode.
- Use the synchronous publishing mode with the
non_blocking_sendoption. - Use previous configuration adding to the custom transport the
maxMessageSizeoption.
This will provide us more information about your issue, as we have been unable to reproduce it.
it's been 3 years so that i checked if this is still reproducible with current container images ros:rolling, ros:iron and ros:humble. and it turned out i cannot reproduce this issue with those containers, so i will go ahead to close this issue.
please feel free to reopen the issue if you still have the problem.
Test Platform
- Raspi4 / Host Ubuntu20.04, docker
ros:rolling,ros:ironandros:humble. (showimage as subscriber) - x86_64 / Host Ubuntu20.04, docker
ros:rolling,ros:ironandros:humble. (cam2image as publisher)
Result
Subscription running on Raspi4 keeps receiving image data, and publisher does not freeze when the subscriber is initiated.
Console Output
root@tomoyafujita:/# ros2 run image_tools cam2image --ros-args -p burger_mode:=true -p frequency:=10. -p reliability:=best_effort
...<snip>
[INFO] [1694125748.789014474] [cam2image]: Publishing image #3647
[INFO] [1694125748.889001996] [cam2image]: Publishing image #3648
[INFO] [1694125748.989015319] [cam2image]: Publishing image #3649
[INFO] [1694125749.089007292] [cam2image]: Publishing image #3650
[INFO] [1694125749.189011144] [cam2image]: Publishing image #3651
^C[INFO] [1694125749.261280833] [rclcpp]: signal_handler(signum=2)
root@raspi4-1:/# ros2 run image_tools showimage --ros-args -p show_image:=false -p reliability:=best_effort
...<snip>
[INFO] [1694125747.596249367] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.696080180] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.795356557] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.896058764] [showimage]: Received image #camera_frame
Received image #camera_frame
[INFO] [1694125747.996081833] [showimage]: Received image #camera_frame
Received image #camera_frame
It has been quite a while, that's right, but unfortunately the issue persists. Publishing (ROS2 Foxy/Iron same thing) sensor_msgs/Image message without any subscribers works at ~20Hz which is expected value. Using image viewer in rqt does not in fact slow down rate of publications which is verified by ros2 topic hz /image_raw, and video smoothness in locally ran rqt.
Just creating a subscription on another PC (Matlab 2024a - ROS Humble) located in the same local network does in fact slow down publication rate which can be again verified using the same two methods, frequency drops to about 1.5Hz and there is significant smoothness drop.
Deleting subscriber restores smoothness and original frequency. QoS reliability used in both publisher and subscriber is set to best effort, however I also checked another option where publisher is configured using rclcpp::SensorQoS, where queue size could be set to 1 or 100 on both ends without any effect.