geometry icon indicating copy to clipboard operation
geometry copied to clipboard

tf_remap very CPU-intensive

Open peci1 opened this issue 5 years ago • 4 comments

I was wondering what's eating my CPU when replaying bagfiles, and I found out it's tf_remap. In the bagfile, I have TF messages at about 1-2 kHz, and the code I'm running spawns like 10 TF listeners. This is enough to saturate 2 CPU cores on my machine just by running the tf_remap node.

I verified the remapping itself is not the bottleneck - the bottleneck really is the rospy.Publisher.publish(), probably not able to publish at such a high rate from python to that many subscribers.

So I've written an experimental C++ tf_remap, and the CPU load is much lower (about 30-40% of one CPU core). This package also solves a few more pain points of tf_remap, like the inability to work with tf_static, or the inability to remove TF frames.

Now the question is: do we want to merge this package as a tool to geometry2 (not to geometry, since it's TF2-native)? Or should I rather release it as a standalone package? I assume that completely substituting the current Python node with the C++ one is not an option for many reasons...

Another thing after the previous question is solved: should there be a warning added to tf/tf_remap that it's known to be inefficient, and with a link to the C++ node if the user observes performance problems?


Here's the code that can easily achieve the high-load situation I'm describing (note that tf/static_transform_publisher does not use /tf_static, but it periodically publishes to /tf, in this case at 1 kHz):

#!/bin/bash

trap "trap - SIGTERM && kill -- -$$" SIGINT SIGTERM EXIT

roscore -p $(echo $ROS_MASTER_URI | cut -d':' -f3) &

sleep 5

for i in `seq 1 10`; do
	rosrun tf tf_monitor 2>/dev/null 1>/dev/null &
done

rosrun tf static_transform_publisher 0 0 0 0 0 0 a b 1 /tf:=/tf_old &
rosrun tf static_transform_publisher 0 0 0 0 0 0 d e 1 /tf:=/tf_old &

rosrun tf tf_remap _mappings:='[{"old": "b", "new": "c"}]' &

htop -p $(ps aux | grep tf_remap | grep -v grep | sed 's/\s\+/ /g' | cut -d' ' -f2)


For completeness, I attach profiler outputs from profiling the original Python node and the C++ node: profilers.zip . In Python, thread.lock.acquire takes almost all the self time, seconded by write_data with 1.5% . In C++, pthread_mutex_lock is also the most time-consuming call, but still taking only 17% of the self time.

peci1 avatar Sep 05 '18 16:09 peci1