teb_local_planner icon indicating copy to clipboard operation
teb_local_planner copied to clipboard

Improve the performance of computing sin and cos functions

Open jcmonteiro opened this issue 3 years ago • 15 comments

Once again, this PR affects quite a lot of files, but the change is actually not that big; besides, it is optional. :nerd_face:

TL;DR

For me, this PR saves 40% CPU without any noticeable impact on convergence. The maximum absolute error between the analytic sin and cos functions and the approximations is 0.003.

Why

In the application I work on, the major performance bottleneck we faced when using TEB was computing three simple functions: sin, cos, and pow. Initially, that came as a surprise for me, but it does make sense since these functions must be used all over to compute the graph edges.

Summary

There are basically two ways of speeding up the computation of trigonometric functions: look-up tables and mathematical approximations. I did not benchmark any of the other ways, but I am confident the approximation I used is more than enough to disappear with the performance bottleneck due to the trigonometric functions.

Comparison

As an example, this is the typical flame graph I get without the trigonometric approximations (sorry, but I had to blur the image where proprietary code is called). teb_before

And this one is with approximations for sin and cos. teb_sincos

So, for me, the improvement was roughly 40% less CPU usage.

Parameters

use_sin_cos_approximation under a new tab called "Performance". This toggles the approximations and defaults to false.

Formula

It is a pretty simple approximation, I take a second-order approximation in the 0 to 45 degrees interval, and then use trigonometric identities to replicate the result to the other quadrants. For comparison, these are the analytic functions and their approximations. image

And these are the absolute errors. image

jcmonteiro avatar Oct 10 '20 13:10 jcmonteiro

The description was already too big, so I decided to add this as a comment. If you find this PR useful, I have one other change where I limit the cost exponent to be an integer in the [1, 4]. Doing so, I get an extra 10% performance improvement, totaling 50%.

This is the flamegraph with the changes to sin, cos, and pow. teb_all

jcmonteiro avatar Oct 10 '20 13:10 jcmonteiro

This is awesome, thanks a lot. We will test this soon.

amakarow avatar Oct 17 '20 15:10 amakarow

I just noticed that there are some other places that use sin and cos that were outside my use case. I'll update the PR to include them later.

jcmonteiro avatar Oct 23 '20 10:10 jcmonteiro

Sorry for taking so long to get back to this PR. I have exhaustively tested the changes and decided to enable sin and cos approximations only for computing the distance to footprint models, thus force-pushing to the branch. As it turns out, all the performance bottleneck is in these computations, and enabling approximations to the edges hindered convergence in some edge cases (no pun intended).

jcmonteiro avatar Nov 09 '20 18:11 jcmonteiro

Just commenting here as a user of TEB. IMHO would be nice to keep the same API as the sin/cos provided by math.h/cmath. This would also limit the changes in other parts of the code and is easier to maintain than the function writing the result into a reference. Moreover, can one limit code duplication between sin/cos by exploiting the phase shift of pi/2 between sine and cosine? How would the approximation suggested here perform with others, e.g. https://github.com/kennyalive/fast-sine-cosine/blob/master/src/main.cpp (random one found via Google).

RainerKuemmerle avatar Nov 14 '20 10:11 RainerKuemmerle

IMHO would be nice to keep the same API as the sin/cos provided by math.h/cmath. This would also limit the changes in other parts of the code and is easier to maintain than the function writing the result into a reference.

Yes and no. There is a benefit in computing sin and cos at the same time, so I must take them as references. I then used the same interface for computing only the sine just for consistency.

Moreover, can one limit code duplication between sin/cos by exploiting the phase shift of pi/2 between sine and cosine?

What do you have in mind?

How would the approximation suggested here perform with others, e.g. https://github.com/kennyalive/fast-sine-cosine/blob/master/src/main.cpp (random one found via Google).

I have not benchmarked since the change cleared the performance bottleneck altogether.

jcmonteiro avatar Nov 16 '20 09:11 jcmonteiro

IMHO would be nice to keep the same API as the sin/cos provided by math.h/cmath. This would also limit the changes in other parts of the code and is easier to maintain than the function writing the result into a reference.

Yes and no. There is a benefit in computing sin and cos at the same time, so I must take them as references. I then used the same interface for computing only the sine just for consistency.

To me the API is not comfortable and very prone to errors.

Moreover, can one limit code duplication between sin/cos by exploiting the phase shift of pi/2 between sine and cosine?

What do you have in mind?

image But this does maybe not apply to your approach. Just if you have them as functions returning the sine/cosine one can spare lines of code.

How would the approximation suggested here perform with others, e.g. https://github.com/kennyalive/fast-sine-cosine/blob/master/src/main.cpp (random one found via Google).

I have not benchmarked since the change cleared the performance bottleneck altogether.

Just reasoning on which approximation to use. For instance, the implementation linked above seems very compact - also with respect to the number of operations performed. But not sure on the performance.

RainerKuemmerle avatar Nov 16 '20 11:11 RainerKuemmerle

Just reasoning on which approximation to use. For instance, the implementation linked above seems very compact - also with respect to the number of operations performed. But not sure on the performance.

It is really compact. Thanks for the reference btw. I have tested here the precision and it is better, the maximum error is 0.001 using the approximation you suggested compared to 0.003 with the one I implemented. Besides, the error is zero at zero, which I assume is desirable for numerical conditioning. I'll compare the performance just for the sake of it, but it should be equally good (if not better).

jcmonteiro avatar Nov 18 '20 14:11 jcmonteiro

I have changed the approximation method to the one suggested by @RainerKuemmerle. Although the maximum absolute approximation error is smaller, I've noticed a more pronounced convergence deterioration if using the approximation in the edges. Since I have disabled this and the PR uses the approximations only for distance calculations, I think it doesn't really matter.

Besides, the API is the same as the one in the standard library, as pointed out by @RainerKuemmerle.

jcmonteiro avatar Nov 29 '20 16:11 jcmonteiro

Have you experienced any overhead when using your wrapper for std::sin and std::cos compared to the plain implementation?

amakarow avatar Jan 04 '21 16:01 amakarow

The overhead is unnoticeable given that the other methods called while solving the optimization are much more expensive. Profiling shows no difference as well.

jcmonteiro avatar Jan 05 '21 07:01 jcmonteiro

I forgot about this one for a while. Would this PR be interesting? Should I consider modifications to improve it (besides resolving the conflicts)?

jcmonteiro avatar Nov 29 '21 08:11 jcmonteiro

I forgot about this one for a while. Would this PR be interesting? Should I consider modifications to improve it (besides resolving the conflicts)?

If it delivers the performance improvement you describe, yes, definitely is interesting. Can you first resolve the conflicts, so I can give a try and assess the performance gain?

corot avatar Nov 29 '21 09:11 corot

I didn't notice any speedup on computeVelocityCommands methods. Those are the times for 100 consecutive calls w/o and w/ fast sin/cos doing the same route:

std: 0.425189 0.716500 0.751879 0.404824

fast 0.426448 0.613000 0.681991 0.406975

Neither I notice any reduction on CPU usage, but I did that with top, what obviously is not at all a sound method.

Where the speedup gets reflected? Also there are still many usages of std::sin/cos; I wounder what the speedup can be if we replace all of them with the fast versions. What about making SIN / COS macros that use either std or fast version depending on a compilation flag and use them all along?

PS: sorry, I broke the PR again with the latest merge :disappointed:

corot avatar Nov 30 '21 02:11 corot

I didn't notice any speedup on computeVelocityCommands methods. Those are the times for 100 consecutive calls w/o and w/ fast sin/cos doing the same route:

std: 0.425189 0.716500 0.751879 0.404824

fast 0.426448 0.613000 0.681991 0.406975

Neither I notice any reduction on CPU usage, but I did that with top, what obviously is not at all a sound method.

If I remember it correctly, the computeVelocityCommands was called thousands of times per second in my application because we handled every single cell in the costmap as an obstacle (did not use costmap converters).

Where the speedup gets reflected? Also there are still many usages of std::sin/cos; I wounder what the speedup can be if we replace all of them with the fast versions. What about making SIN / COS macros that use either std or fast version depending on a compilation flag and use them all along?

I have left the changes out of the edge calculations because adding them there caused the optimization to oscillate. The macro is also an idea but it would

  1. prevent changing in realtime,
  2. make it impossible to use the features for those not compiling from source.

PS: sorry, I broke the PR again with the latest merge disappointed

That's alright =). I will see if I can make the changes asap. The full disclaimer is that I am not compiling or running since my company uses ROS2 now.

jcmonteiro avatar Dec 01 '21 07:12 jcmonteiro