rclpy
rclpy copied to clipboard
Humble has higher CPU usage compared to Foxy
Bug report
Required Info:
- Operating System:
- Nvidia Jetson Xavier NX, Ubuntu 18.04
- Installation type:
- from source
- Version or commit hash:
- foxy and humble
- rclpy(Foxy): 1.0.13
- rclpy(Humble): 3.3.9
- DDS implementation:
- Fast-RTPS
- Client library (if applicable):
- rclpy
Steps to reproduce issue
ros2 topic pub -r 100 /testt std_msgs/msg/String "{data: hello}"
ros2 topic hz /testt
Expected behavior
Humble has the same cpu usage as Foxy
Actual behavior
Humble has higher CPU usage
Humble:
Foxy:
Additional information
If I use C++ to receive, the CPU usage remains largely unchanged, with Humble even slightly lower. Hence, I am more suspicious that the issue lies with rclpy. It is likely that ros2 topic hz
is also implemented using Python.
Humble:
Foxy:
At the same time, I conducted tests on an X86 computer, and the performance of Humble was decent, with only a slight increase in CPU usage compared to Foxy.
But if I test with the Docker image arm64v8/ros:humble-ros-core, the CPU usage is normal. Could it be because there's an issue with the versions of rclpy libraries I compiled?
Nvidia Jetson Xavier NX, Ubuntu 18.04
Are you running the container image on top of this host system?
rclpy(Foxy): 1.0.13
with Ubuntu 20.04? (btw, Foxy is E.O.L)
rclpy(Humble): 3.3.9
and with Ubuntu 22.04?
Could it be because there's an issue with the versions of rclpy libraries I compiled?
unlikely, both versions are almost HEAD, I do not think so. there are many difference between them, i am not sure which fix generates this performance difference though.
Nvidia Jetson Xavier NX, Ubuntu 18.04
Are you running the container image on top of this host system?
container image
rclpy(Foxy): 1.0.13
with Ubuntu 20.04? (btw, Foxy is E.O.L)
Ubuntu 18.04.
rclpy(Humble): 3.3.9
and with Ubuntu 22.04?
Ubuntu 18.04, This image is the same as Foxy's image.This is built based on the Nvidia L4T image.
Could it be because there's an issue with the versions of rclpy libraries I compiled?
unlikely, both versions are almost HEAD, I do not think so. there are many difference between them, i am not sure which fix generates this performance difference though.
This is my dockerfile:
FROM nvcr.io/nvidia/l4t-base:r32.5.0
ENV OPENBLAS_CORETYPE=ARMV8
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
ARG DEBIAN_FRONTEND=noninteractive
ARG TOOLS=" \
build-essential \
fish \
sudo \
git \
vim \
python3-pip \
gnupg \
supervisor \
iputils-ping \
iproute2 \
net-tools \
locales \
lrzsz \
curl \
wget \
v4l-utils \
libopenblas-dev \
nvidia-tensorrt \
nvidia-cudnn8 \
nvidia-cuda \
nvidia-opencv \
python3-libnvinfer \
python3-opencv"
ADD assets/cmake-3.20.0-linux-aarch64.tar.gz /usr/local/
RUN sed -i 's@http://.*.ubuntu.com@http://mirrors.tuna.tsinghua.edu.cn@g' /etc/apt/sources.list \
&& localedef -c -f UTF-8 -i zh_CN zh_CN.UTF-8 \
&& ln -s /usr/local/cmake-3.20.0-linux-aarch64/bin/* /usr/local/bin/ \
&& apt-get update \
&& apt-get install --no-install-recommends -y software-properties-common dirmngr gpg-agent \
&& apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc \
&& echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" > /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
&& echo "deb https://repo.download.nvidia.com/jetson/t194 r32.5 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
&& apt-add-repository -y ppa:fish-shell/release-3 \
&& apt-get install --no-install-recommends -y ${TOOLS} \
&& rm -rf /var/lib/apt/lists/*
# Install python package
RUN git config --global http.sslVerify "false" \
&& pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade pip \
&& pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple \
&& cd /tmp \
&& wget -m ftp://...........................:12221/home/Drive/wheels/nano \
&& pip3 install --ignore-requires-python --ignore-installed /tmp/nas.oca-teh.cn:12221/home/Drive/wheels/nano/* \
&& rm -rf /tmp/nas.oca-teh.cn:12221
# ROS2
RUN locale-gen zh_CN zh_CN.UTF-8 \
&& update-locale LC_ALL=zh_CN.UTF-8 LANG=zh_CN.UTF-8
ENV LANG=en_US.UTF-8
RUN add-apt-repository universe \
&& curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key -o /usr/share/keyrings/ros-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/ros-archive-keyring.gpg] http://packages.ros.org/ros2/ubuntu $(. /etc/os-release && echo $UBUNTU_CODENAME) main" | tee /etc/apt/sources.list.d/ros2.list > /dev/null \
&& add-apt-repository -y -n ppa:borglab/gtsam-release-4.1 \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
python3-rosinstall-generator \
python3-colcon-common-extensions \
python3-rosdep \
libpython3-dev \
libflann-dev \
libboost-all-dev \
libgtsam-dev \
libgtsam-unstable-dev \
libtbb-dev \
&& pip3 install -U vcstool argcomplete
RUN mkdir -p /opt/ros/foxy/src \
&& cd /opt/ros/foxy \
&& vcs import --input https://........................:9091/-/snippets/855/raw/master/ros2.repo src \
&& rosdep init \
&& rosdep update \
&& rosdep install --from-paths src --ignore-src -y --skip-keys "fastcdr rti-connext-dds-5.3.1 urdfdom_headers libpcl-all-dev" \
&& cd /tmp \
&& wget -nv https://github.com/PointCloudLibrary/pcl/releases/download/pcl-1.13.1/source.tar.gz \
&& tar -xzf source.tar.gz \
&& mkdir pcl/build \
&& cd pcl/build \
&& cmake -DCMAKE_BUILD_TYPE=Release .. \
&& make -j30 \
&& make install \
&& rm -rf /tmp/source.tar.gz \
&& rm -rf /tmp/pcl \
&& cd /opt/ros/foxy \
&& colcon build --merge-install \
&& rm -rf build log src \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /root
If I compile the ROS Humble on Ubuntu 22.04, the CPU usage returns to the same level as Foxy. But it still remains higher than arm64v8/ros:humble-ros-core
.
any advice ?
unfortunately i do not have any clue on this, you might need to use profiler or process space analyzer which call stack increases the complexity to narrow down, i guess.
@fujitatomoya Through an open-source analysis tool, VizTracer library, I discovered that within each spin_once
iteration, the add_node
function in the Humble takes longer compared to Foxy. Subsequently, when I commented out the line self._guard.trigger()
within the add_node
function of the Humble, CPU usage noticeably decreased by 10%. Therefore, is there any modification in the garbage collection aspect causing this performance degradation?
add_node function in the Humble takes longer compared to Foxy.
i am not sure what exactly leads to this performance comparison, but from foxy to humble, there were huge changes including WaitSet
with pybind11. that might be some performance impact, and i do not think we could roll back these changes to foxy...
Subsequently, when I commented out the line self._guard.trigger() within the add_node function of the Humble
this explains since collecting entities is the work.
but i think you do not want to do this in the application. this trigger()
will wake up the WaitSet
to go to the next loop to collect entities to wait. (note, that is not garbage collection, guard condition.)
Why is add_node called first in each spin_once iteration, followed by remove_node? Can using StaticSingleThreadedExecutor
help avoid this situation?
Why is add_node called first in each spin_once iteration, followed by remove_node? Can using StaticSingleThreadedExecutor help avoid this situation?
I'm not sure if rclpy::spin() is you want.
https://github.com/ros2/rclpy/blob/6cd6fd19e8f27c4e55a1fd082db98dc99b1f085d/rclpy/rclpy/init.py#L225-L242
So, this means that ros2 topic hz
is essentially looping through calls to rclpy.spin_once(node)
, and each spin_once
involves both add_node
and remove_node
. Since the performance of add_node
is slightly worse in Humble compared to Foxy, is this the cause of the performance difference?
Next, I will test calling spin(node)
to avoid redundant add_node
calls and observe the differences between Foxy and Humble.
So, this means that ros2 topic hz is essentially looping through calls to rclpy.spin_once(node), and each spin_once involves both add_node and remove_node. Since the performance of add_node is slightly worse in Humble compared to Foxy, is this the cause of the performance difference?
Refer to code
https://github.com/ros2/ros2cli/blob/419afea1282ae4daa43ea446a2c0bcc7394f856c/ros2topic/ros2topic/verb/hz.py#L262-L264
As fujitatomoya said, there are huge changes between foxy and Humble. Performance impacts could introduce from the changes from many different modules.
I continued testing using the spin
function and indeed, the repetitive calls to add_node were eliminated. However, there still exists a difference in CPU usage on the ARMv7 platform. During testing on the ARMv7 platform, the difference in CPU usage exceeded two-fold, with Foxy around 35% and Humble around 70%. Upon analysis, it seems that the prolonged execution of the _take_subscription
and trigger
functions is causing this disparity. I am unsure about how to proceed with further analysis.
I am not sure either how to proceed further right now. If that was our requirement, i would use traceability to check the each function call including implementation to get to the bottle neck.