rclpy icon indicating copy to clipboard operation
rclpy copied to clipboard

Humble has higher CPU usage compared to Foxy

Open rty813 opened this issue 1 year ago • 15 comments

Bug report

Required Info:

  • Operating System:
    • Nvidia Jetson Xavier NX, Ubuntu 18.04
  • Installation type:
    • from source
  • Version or commit hash:
    • foxy and humble
    • rclpy(Foxy): 1.0.13
    • rclpy(Humble): 3.3.9
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

ros2 topic pub -r 100 /testt std_msgs/msg/String "{data: hello}"
ros2 topic hz /testt

Expected behavior

Humble has the same cpu usage as Foxy

Actual behavior

Humble has higher CPU usage

Humble: image

Foxy: image

Additional information

If I use C++ to receive, the CPU usage remains largely unchanged, with Humble even slightly lower. Hence, I am more suspicious that the issue lies with rclpy. It is likely that ros2 topic hz is also implemented using Python.

Humble: image

Foxy: image


At the same time, I conducted tests on an X86 computer, and the performance of Humble was decent, with only a slight increase in CPU usage compared to Foxy. image

rty813 avatar Oct 23 '23 03:10 rty813

But if I test with the Docker image arm64v8/ros:humble-ros-core, the CPU usage is normal. Could it be because there's an issue with the versions of rclpy libraries I compiled? image

rty813 avatar Oct 23 '23 06:10 rty813

Nvidia Jetson Xavier NX, Ubuntu 18.04

Are you running the container image on top of this host system?

rclpy(Foxy): 1.0.13

with Ubuntu 20.04? (btw, Foxy is E.O.L)

rclpy(Humble): 3.3.9

and with Ubuntu 22.04?

Could it be because there's an issue with the versions of rclpy libraries I compiled?

unlikely, both versions are almost HEAD, I do not think so. there are many difference between them, i am not sure which fix generates this performance difference though.

fujitatomoya avatar Oct 23 '23 23:10 fujitatomoya

Nvidia Jetson Xavier NX, Ubuntu 18.04

Are you running the container image on top of this host system?

container image

rclpy(Foxy): 1.0.13

with Ubuntu 20.04? (btw, Foxy is E.O.L)

Ubuntu 18.04.

rclpy(Humble): 3.3.9

and with Ubuntu 22.04?

Ubuntu 18.04, This image is the same as Foxy's image.This is built based on the Nvidia L4T image.

Could it be because there's an issue with the versions of rclpy libraries I compiled?

unlikely, both versions are almost HEAD, I do not think so. there are many difference between them, i am not sure which fix generates this performance difference though.

rty813 avatar Oct 24 '23 01:10 rty813

This is my dockerfile:

FROM nvcr.io/nvidia/l4t-base:r32.5.0

ENV OPENBLAS_CORETYPE=ARMV8
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
ENV RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

ARG DEBIAN_FRONTEND=noninteractive

ARG TOOLS=" \
    build-essential \
    fish \
    sudo \
    git \
    vim \
    python3-pip \
    gnupg \
    supervisor \
    iputils-ping \
    iproute2 \
    net-tools \
    locales \
    lrzsz \
    curl \
    wget \
    v4l-utils \
    libopenblas-dev \
    nvidia-tensorrt \
    nvidia-cudnn8 \
    nvidia-cuda \
    nvidia-opencv \
    python3-libnvinfer \
    python3-opencv"

ADD assets/cmake-3.20.0-linux-aarch64.tar.gz /usr/local/
RUN sed -i 's@http://.*.ubuntu.com@http://mirrors.tuna.tsinghua.edu.cn@g' /etc/apt/sources.list \
    && localedef -c -f UTF-8 -i zh_CN zh_CN.UTF-8 \
    && ln -s /usr/local/cmake-3.20.0-linux-aarch64/bin/* /usr/local/bin/ \
    && apt-get update \
    && apt-get install --no-install-recommends -y software-properties-common dirmngr gpg-agent \
    && apt-key adv --fetch-key http://repo.download.nvidia.com/jetson/jetson-ota-public.asc \
    && echo "deb https://repo.download.nvidia.com/jetson/common r32.5 main" > /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
    && echo "deb https://repo.download.nvidia.com/jetson/t194 r32.5 main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
    && apt-add-repository -y ppa:fish-shell/release-3 \
    && apt-get install --no-install-recommends -y ${TOOLS} \
    && rm -rf /var/lib/apt/lists/*

# Install python package
RUN git config --global http.sslVerify "false" \
    && pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple --upgrade pip \
    && pip3 config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple \
    && cd /tmp \
    && wget -m ftp://...........................:12221/home/Drive/wheels/nano \
    && pip3 install --ignore-requires-python --ignore-installed /tmp/nas.oca-teh.cn:12221/home/Drive/wheels/nano/* \
    && rm -rf /tmp/nas.oca-teh.cn:12221

# ROS2
RUN locale-gen zh_CN zh_CN.UTF-8 \
    && update-locale LC_ALL=zh_CN.UTF-8 LANG=zh_CN.UTF-8
ENV LANG=en_US.UTF-8
RUN add-apt-repository universe \
    && curl -sSL https://raw.githubusercontent.com/ros/rosdistro/master/ros.key -o /usr/share/keyrings/ros-archive-keyring.gpg \
    && echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/ros-archive-keyring.gpg] http://packages.ros.org/ros2/ubuntu $(. /etc/os-release && echo $UBUNTU_CODENAME) main" | tee /etc/apt/sources.list.d/ros2.list > /dev/null \
    && add-apt-repository -y -n ppa:borglab/gtsam-release-4.1 \
    && apt-get update \
    && apt-get install -y --no-install-recommends \
        python3-rosinstall-generator \
        python3-colcon-common-extensions \
        python3-rosdep \
        libpython3-dev \
        libflann-dev \
        libboost-all-dev \
        libgtsam-dev \
        libgtsam-unstable-dev \
        libtbb-dev \
    && pip3 install -U vcstool argcomplete

RUN mkdir -p /opt/ros/foxy/src \
    && cd /opt/ros/foxy \
    && vcs import --input https://........................:9091/-/snippets/855/raw/master/ros2.repo src \
    && rosdep init \
    && rosdep update \
    && rosdep install --from-paths src --ignore-src -y --skip-keys "fastcdr rti-connext-dds-5.3.1 urdfdom_headers libpcl-all-dev" \
    && cd /tmp \
    && wget -nv https://github.com/PointCloudLibrary/pcl/releases/download/pcl-1.13.1/source.tar.gz \
    && tar -xzf source.tar.gz \
    && mkdir pcl/build \
    && cd pcl/build \
    && cmake -DCMAKE_BUILD_TYPE=Release .. \
    && make -j30 \
    && make install \
    && rm -rf /tmp/source.tar.gz \
    && rm -rf /tmp/pcl \
    && cd /opt/ros/foxy \
    && colcon build --merge-install \
    && rm -rf build log src \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /root

rty813 avatar Oct 24 '23 01:10 rty813

If I compile the ROS Humble on Ubuntu 22.04, the CPU usage returns to the same level as Foxy. But it still remains higher than arm64v8/ros:humble-ros-core.

image

rty813 avatar Oct 25 '23 03:10 rty813

any advice ?

rty813 avatar Nov 01 '23 06:11 rty813

unfortunately i do not have any clue on this, you might need to use profiler or process space analyzer which call stack increases the complexity to narrow down, i guess.

fujitatomoya avatar Nov 01 '23 06:11 fujitatomoya

@fujitatomoya Through an open-source analysis tool, VizTracer library, I discovered that within each spin_once iteration, the add_node function in the Humble takes longer compared to Foxy. Subsequently, when I commented out the line self._guard.trigger() within the add_node function of the Humble, CPU usage noticeably decreased by 10%. Therefore, is there any modification in the garbage collection aspect causing this performance degradation?

image image

rty813 avatar Nov 15 '23 11:11 rty813

add_node function in the Humble takes longer compared to Foxy.

i am not sure what exactly leads to this performance comparison, but from foxy to humble, there were huge changes including WaitSet with pybind11. that might be some performance impact, and i do not think we could roll back these changes to foxy...

Subsequently, when I commented out the line self._guard.trigger() within the add_node function of the Humble

this explains since collecting entities is the work. but i think you do not want to do this in the application. this trigger() will wake up the WaitSet to go to the next loop to collect entities to wait. (note, that is not garbage collection, guard condition.)

fujitatomoya avatar Nov 15 '23 17:11 fujitatomoya

Why is add_node called first in each spin_once iteration, followed by remove_node? Can using StaticSingleThreadedExecutor help avoid this situation?

rty813 avatar Nov 16 '23 01:11 rty813

Why is add_node called first in each spin_once iteration, followed by remove_node? Can using StaticSingleThreadedExecutor help avoid this situation?

I'm not sure if rclpy::spin() is you want.
https://github.com/ros2/rclpy/blob/6cd6fd19e8f27c4e55a1fd082db98dc99b1f085d/rclpy/rclpy/init.py#L225-L242

Barry-Xu-2018 avatar Nov 16 '23 05:11 Barry-Xu-2018

So, this means that ros2 topic hz is essentially looping through calls to rclpy.spin_once(node), and each spin_once involves both add_node and remove_node. Since the performance of add_node is slightly worse in Humble compared to Foxy, is this the cause of the performance difference?

Next, I will test calling spin(node) to avoid redundant add_node calls and observe the differences between Foxy and Humble.

rty813 avatar Nov 16 '23 06:11 rty813

So, this means that ros2 topic hz is essentially looping through calls to rclpy.spin_once(node), and each spin_once involves both add_node and remove_node. Since the performance of add_node is slightly worse in Humble compared to Foxy, is this the cause of the performance difference?

Refer to code

https://github.com/ros2/ros2cli/blob/419afea1282ae4daa43ea446a2c0bcc7394f856c/ros2topic/ros2topic/verb/hz.py#L262-L264

As fujitatomoya said, there are huge changes between foxy and Humble. Performance impacts could introduce from the changes from many different modules.

Barry-Xu-2018 avatar Nov 16 '23 08:11 Barry-Xu-2018

I continued testing using the spin function and indeed, the repetitive calls to add_node were eliminated. However, there still exists a difference in CPU usage on the ARMv7 platform. During testing on the ARMv7 platform, the difference in CPU usage exceeded two-fold, with Foxy around 35% and Humble around 70%. Upon analysis, it seems that the prolonged execution of the _take_subscription and trigger functions is causing this disparity. I am unsure about how to proceed with further analysis.

image image

rty813 avatar Nov 16 '23 09:11 rty813

I am not sure either how to proceed further right now. If that was our requirement, i would use traceability to check the each function call including implementation to get to the bottle neck.

fujitatomoya avatar Nov 16 '23 22:11 fujitatomoya