geometry2 Testing the water for possible improvements

Hey,

I've noticed that we have a lot of spinning and busy waiting in this lib. I did a "prove of concept" project fast_tf (forgive me the cringy name) to see how these short comings might be addressed.

I've implemented the lib as follows:

Check first if the source and target frames are connected.
If they aren't connected, a condition-variable will block the thread and will be notified if a new connection is formed.
Once the frames are connected, I'm traversing starting form the "deeper" branch (I had to add a depth flag to every frame) up the tree, until both branches meet. If a link is not available (wrong time stamp), a condition variable will block until the stalled link receives new data.
I've also swapped the "backend" lib for Eigen, but this is highly optional
I've added checks if the link is static or dynamic to prevent a "remapping" - this is also highly optional.
I've swapped ros::time with std::chrono (as done in the ros2 version), but again, this is highly optional.

The results are interesting: I could measure a 75% improvement over the current version (have a look at the perf folder and the image on the README). Based on this, I would like to try to integrate the ideas from the fast_tf into this lib.

However, given the magnitude of changes and the work required for porting the ideas, I wanted first to check the water - and ask if you're interested in the changes.

If yes, we might discuss which changes make sense and how you would like to receive the PRs (all at once or step by step).

Best, Dima

Jul 12 '21 16:07 dorezyuk

I'm not sure where you meant to file this. This ticket is on ROS 2 but your talking about things that are already in this implementation. Overall improvements to the ROS 2 version are definitely of interest. The ROS 1 versions are already released into their last distros so we won't want to significantly change things there.

The only place I'm aware of looping is when you've called the API to wait, and yes the latency and efficiency there can definitely be improved, however that API is generally just for low performance low throughput interfaces and is focused on not adding significantly to the latency of a data processing pipeline which is blocked waiting for incoming data. In particular your charts are testing it with 200 parallel threads in blocking calls for which it's definitely not currently optimized. There's definitely room for improvement on the waiting and retry logic, however in particular i see that you haven't implemented the message filters which are do a lot of the queuing and triggering which was designed to take advantage of the asynchronous and structured nature of the data arrivals and not keeping N threads alive for every input but queuing them for processing.

I'd like to see some of these changes teased out and understood separately. And there are many different metrics for performance that should be considered. Off the top of my head there's lookup latency, lookup throughput, lookup overhead, insertion latency, insertion throughput, insertion overhead. As well as other metrics such as memory footprint and layout and context/thread switching costs.

Jul 13 '21 04:07 tfoote

I'm not sure where you meant to file this. This ticket is on ROS 2 but your talking about things that are already in this implementation. Overall improvements to the ROS 2 version are definitely of interest.

Ouch, wrong geometry. I wanted to test the water for ros/geometry2...

The ROS 1 versions are already released into their last distros so we won't want to significantly change things there.

I think there is (still) a huge community of ROS1 users. Given how central the geometry2 lib is for the ROS eco-system, I also think it still makes sense to put work into it. Would there be a way how we might incorporate some improvements into the ROS1 lib?

The only place I'm aware of looping is when you've called the API to wait, and yes the latency and efficiency there can definitely be improved, however that API is generally just for low performance low throughput interfaces and is focused on not adding significantly to the latency of a data processing pipeline which is blocked waiting for incoming

In the ROS1 lib (and I think also here) every method of the buffer_interface accepts a timeout and will spin until it elapses. I wasn't aware that they are "low performance and low throughput" as it always seemed that these APIs would be preverable as they are compatible with other types of TF Buffers (like the buffer_client.h or the tf_service).

In particular your charts are testing it with 200 parallel threads in blocking calls for which it's definitely not currently optimized

I think its not about the 200 threads, as the CPU consumption for both implementations increases linearly for the number of threads.

message filters which are do a lot of the queuing and triggering which was designed to take advantage of the asynchronous and structured nature of the data arrivals and not keeping N threads alive for every input but queuing them for processing

True, and maybe comparing the message filter interface will yield other results. However, at least to me, it would be nice to have also an efficient implementation which complies to the buffer_interface.h

I'd like to see some of these changes teased out and understood separately. And there are many different metrics for performance that should be considered. Off the top of my head there's lookup latency, lookup throughput, lookup overhead, insertion latency, insertion throughput, insertion overhead. As well as other metrics such as memory footprint and layout and context/thread switching costs

There are some other benchmarks in the perf folder focusing on insertion and lookup throughput - but I agree, that the picture is incomplete. Nonetheless, I still wanted first to reach out and check if its worth the time.

Jul 13 '21 08:07 dorezyuk

geometry2 geometry2 copied to clipboard

Testing the water for possible improvements

geometry2
geometry2 copied to clipboard