geometry
geometry copied to clipboard
tf_remap very CPU-intensive
I was wondering what's eating my CPU when replaying bagfiles, and I found out it's tf_remap. In the bagfile, I have TF messages at about 1-2 kHz, and the code I'm running spawns like 10 TF listeners. This is enough to saturate 2 CPU cores on my machine just by running the tf_remap node.
I verified the remapping itself is not the bottleneck - the bottleneck really is the rospy.Publisher.publish()
, probably not able to publish at such a high rate from python to that many subscribers.
So I've written an experimental C++ tf_remap, and the CPU load is much lower (about 30-40% of one CPU core). This package also solves a few more pain points of tf_remap, like the inability to work with tf_static, or the inability to remove TF frames.
Now the question is: do we want to merge this package as a tool to geometry2 (not to geometry, since it's TF2-native)? Or should I rather release it as a standalone package? I assume that completely substituting the current Python node with the C++ one is not an option for many reasons...
Another thing after the previous question is solved: should there be a warning added to tf/tf_remap
that it's known to be inefficient, and with a link to the C++ node if the user observes performance problems?
Here's the code that can easily achieve the high-load situation I'm describing (note that tf/static_transform_publisher
does not use /tf_static
, but it periodically publishes to /tf
, in this case at 1 kHz):
#!/bin/bash
trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT
roscore -p $(echo $ROS_MASTER_URI | cut -d':' -f3) &
sleep 5
for i in `seq 1 10`; do
rosrun tf tf_monitor 2>/dev/null 1>/dev/null &
done
rosrun tf static_transform_publisher 0 0 0 0 0 0 a b 1 /tf:=/tf_old &
rosrun tf static_transform_publisher 0 0 0 0 0 0 d e 1 /tf:=/tf_old &
rosrun tf tf_remap _mappings:='[{"old": "b", "new": "c"}]' &
htop -p $(ps aux | grep tf_remap | grep -v grep | sed 's/\s\+/ /g' | cut -d' ' -f2)
For completeness, I attach profiler outputs from profiling the original Python node and the C++ node: profilers.zip . In Python, thread.lock.acquire
takes almost all the self time, seconded by write_data
with 1.5% . In C++, pthread_mutex_lock
is also the most time-consuming call, but still taking only 17% of the self time.