rmw_fastrtps
rmw_fastrtps copied to clipboard
CPU spikes of existing nodes when starting new node
Bug report
Required Info:
-
Operating System:
- Ubuntu 22.04
-
Installation type:
- binaries
-
Version or commit hash:
- Humble
- ros-humble-fastrtps/now 2.6.7-1jammy.20240125.204216 amd64 [installed,local]
- ros-humble-rmw-fastrtps-cpp/now 6.2.6-1jammy.20240125.215950 amd64 [installed,local]
-
DDS implementation:
- FastDDS
Steps to reproduce issue
- Use default XML configuration (FASTRTPS_DEFAULT_PROFILES_FILE not set)
- I have my robot bringup running consuming a decent amount of CPU with about 70 nodes (across several docker containers sharing network and ipc with the host if that is relevant)
- Launch some node on the side (e.g. teleop, ros2 topic echo, etc...)
- Witness a CPU spike
Expected behavior
No considerable CPU spike for the existant nodes
Actual behavior
CPU spikes for a few seconds for all the nodes to about double their consumption! I'm guessing it has to do with discovery?
Additional information
I quickly tried with Cyclone and did not witness the CPU spike but I would like to fix it with Fastdds if possible (otherwise will have to switch)
With some experimentation I also noticed that the higher the number of existing nodes, the higher the CPU rise is when an extra node is added to the network
@tonynajjar thanks for creating issue. we have been meeting the similar situation...
a couple of things,
- Initial Announcements can be related to the CPU usage spike during the discovery process. Depends on network resource and reliability, and also requirement for discovery latency, but this could mitigate the CPU usage spike during discovery initial state? (i believe this setting will be also applied to Endpoint Discovery.)
- Do you guys happen to use ROS 2 Security Enclaves? Enabling security brings more work like handshaking during discovery process.
I am not sure if you can use ROS 2 Fast-DDS Discovery Server since it changes the architecture, either acceptable or not this will reduce the discovery cost significantly.
CC: @MiguelCompany @EduPonz
Thanks for your answer @fujitatomoya.
-
We do not use security enclaves.
-
I did quickly try to make the Discovery server work to confirm it was a discovery issue but failed to do so for some reason; maybe because of my docker setup, not sure. But anyway on the long run I'd like to avoid using the discovery server (no strong reason but it feels like going back to the ROS1 centralized approach which was criticized and changed in ROS2)
-
Regarding the Initial Announcements, are you proposing testing out something with the config? I'm really not a DDS configuration expert (as most roboticists) so you have to spell it out for me 😅
Depends on network resource and reliability, and also requirement for discovery latency
Your comment reminded me to clarify that all the nodes are running on one machine so I guess the issue can't be caused by a suboptimal network.
Regarding the Initial Announcements, are you proposing testing out something with the config?
i think you can create DEFAULT_FASTRTPS_PROFILES.xml in the running directory where you issue ros2 run xxx, and it should be loaded to the Fast-DDS. (initial announcement count is 1 from 5 and period is changed into 500 msec from 100 msec below.)
<participant profile_name="participant_profile_simple_discovery">
<rtps>
<builtin>
<discovery_config>
<initialAnnouncements>
<count>1</count>
<period>
<nanosec>500000000</nanosec>
</period>
</initialAnnouncements>
</discovery_config>
</builtin>
</rtps>
</participant>
my expectation here is,
- with existed
70ROS 2 context (70 Participants), new ROS 2 node (context) will send initial discovery packet 5 times with 100 msec periods in default. and each of 70 receivers get these packets then sends back the own participant's information every time. this could generate the CPU usage spike. (reliable and good latency discovery, but expensive?) - if we have all nodes in localhost, network is reliable enough. so we could just send a single shot initial discovery for each participant for initial announcement?
anyway, i would like to have the opinion from eProsima. hopefully this helps,
Thanks @fujitatomoya, this is indeed what I would have suggested to try out as well. Please @tonynajjar do let us know how it goes.
Thank you for your recommendation. Unfortunately it did not work. All the nodes in my localhost network are running this configuration:
<?xml version="1.0" encoding="UTF-8" ?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<participant profile_name="participant_profile_simple_discovery" is_default_profile="true">
<rtps>
<builtin>
<discovery_config>
<initialAnnouncements>
<count>1</count>
<period>
<nanosec>500000000</nanosec>
</period>
</initialAnnouncements>
</discovery_config>
</builtin>
</rtps>
</participant>
</profiles>
I still get the CPU spike
@tonynajjar i am curious, what command did you use for this verification? e.g ros2 topic xxx without daemon starting?
@tonynajjar i am curious, what command did you use for this verification? e.g
ros2 topic xxxwithout daemon starting?
I just started some custom teleop node. But I think 'ros2 topic echo xxx' would also cause the spike; it has in the past
Any alternative solutions I could try? Could someone of the maintainers try to reproduce this so that we at least know for sure that this is not a local/configuration issue? If we can confirm this, I think this bug deserves some high-prio attention, as for applications already reaching the limits of CPU consumption, this bug would be a deal breaker for using fastdds
@tonynajjar CC: @EduPonz
I still get the CPU spike
i think there is still spike after the configuration is applied, but expecting spike period should be mitigated and CPU consumption comes down quicker than before? if you are seeing the no difference, maybe configuration is not applied. make sure that DEFAULT_FASTRTPS_PROFILES.xml in the running directory where you issue ros2 run xxx.
something else i would try is to disable the shared memory transport. our experience tells that shared memory transport provides good performance and latency, but uses more CPU resources in the application. if shared memory transport is disabled, it takes advantage of the network interface resource.
<?xml version="1.0" encoding="UTF-8" ?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
<transport_descriptors>
<transport_descriptor>
<transport_id>udp_transport</transport_id>
<type>UDPv4</type>
</transport_descriptor>
</transport_descriptors>
<participant profile_name="UDPParticipant">
<rtps>
<userTransports>
<transport_id>udp_transport</transport_id>
</userTransports>
<useBuiltinTransports>false</useBuiltinTransports>
</rtps>
</participant>
</profiles>
if anything above does not work, that is out of my league...
Thank you for your answer. I'm pretty sure that the configuration was applied; I made sure by making a typo and seeing errors when I launch the nodes. I didn't really see much difference, maybe I didn't look in great detail but even if the spike goes away quicker than before, having it in the first place is not really acceptable for my application.
Regarding disabling Shared Memory, I think I tried that already but I can't remember for sure; I'll give it another shot in the next few days.
I'd appreciate if someone could try reproducing it. I'll try to create a minimal reproducible launch file, e.g. launching 40 talkers and 40 listeners.
from launch import LaunchDescription
from launch_ros.actions import Node
def generate_launch_description():
# Initialize an empty list to hold all the nodes
nodes = []
# Define the number of talkers and listeners
num = 40
# Create talker nodes
for i in range(num):
talker_node = Node(
package='demo_nodes_cpp',
executable='talker',
namespace='talker_' + str(i), # Use namespace to avoid conflicts
name='talker_' + str(i)
)
nodes.append(talker_node)
# Create listener nodes
for i in range(num):
listener_node = Node(
package='demo_nodes_cpp',
executable='listener',
namespace='listener_' + str(i), # Use namespace to avoid conflicts
name='listener_' + str(i),
remappings=[
(f"/listener_{str(i)}/chatter", f"/talker_{str(i)}/chatter"),
],
)
nodes.append(listener_node)
# Create the launch description with all the nodes
return LaunchDescription(nodes)
Here is a launch file for you to reproduce the issue. After this is launched, run ros2 run demo_nodes_cpp listener in another terminal and see with htop that the CPU of all nodes get multiplied by 2-3.
Because the initial CPU usage of these nodes is not so big, the CPU jump is not so noticeable but from what I tested earlier, this scales when the initial CPU usage is already high.
@fujitatomoya or @EduPonz were you able to reproduce the issue with the example I provided? It would be already useful if I can confirm whether or not this is a bug or suboptimal configuration from my side
@tonynajjar sorry for being late to get back to you. we have know this situation, i did not use your example, but having more than 100 nodes generates the CPU spike for a few seconds. as we already know, this is because of the participant discovery.
i am not sure any other configuration would work to mitigate this transient CPU load...