odeint-v2 4th order Runke-Kutta performance

Dear Developers,

I am using Odeint in a project solving the physical N-body system. I realized that the Runge Kutta stepper provided could work more efficiently.

When I use the stepper named "runge_kutta4_classic", operator+= of state type is called 7 times and operator *= is called 11 times.

When I use the stepper named "runge_kutta4", operator+= is called 10 times and operator *= is called 14 times.

The following code (below) can do a 4th order Runge Kutta step with only 7 += and 5 *= operator calls. Results using this code seem to be identical with the "runge_kutta4_classic" steppers'.

Maybe this is a point where you could improve the RK4 stepper, or I just got something wrong. :)

Btw, could you tell me the difference between steppers runge_kutta4_classic and runge_kutta4? I could not find any documentation regarding this.

Thanks for Odeint, Roland

Phase F1, F2, F3, F4; Phase xtemp;

void rungeKutta4(Phase &x, double t, double tau, void (*derivsRK)(const Phase &x, Phase &deriv, const double t)) {

//* Evaluate F1 = f(x,t).
(*derivsRK)(x, F1, t);

//* Evaluate F2 = f( x+tau*F1/2, t+tau/2 ).
double half_tau = 0.5 * tau;
double t_half = t + half_tau;

xtemp = x + half_tau * F1;
(*derivsRK)(xtemp, F2, t_half);

//* Evaluate F3 = f( x+tau*F2/2, t+tau/2 ).
xtemp = x + half_tau * F2;
(*derivsRK)(xtemp, F3, t_half);

//* Evaluate F4 = f( x+tau*F3, t+tau ).
double t_full = t + tau;
xtemp = x + tau * F3;
(*derivsRK)(xtemp, F4, t_full);

//* Return x(t+tau) computed from fourth-order R-K.
x += tau / 6. * (F1 + F4 + 2. * (F2 + F3));

}

Oct 25 '13 12:10 loczaj

Dear Roland,

thank you for your hint. The difference between runge_kutta4_classic and runge_kutta4 lies in the implementation of the algorithm. runge_kutta4_classical uses a straight-forward implemetation similar to yours, while runge_kutta4 uses an adavanced and generic template meta-programming algorithm which is also used in the other runge_kutta steppers.

Have you measured any performance? It might be possible that both implementations have nearly equal performance due to some optimizations strategies of the compiler.

Oct 25 '13 12:10 headmyshoulder

Hi Roland,

thanks for your interest in odeint. Indeed, the rk4 method seems to be implemented in a suboptimal way. However, I have done extensive performance tests on that matter of rk4,rk4_classic and the numerical recipes code (which is your code as well, i think?) and I could not find any measurable difference. I think the reason is that our standard rk4 involves several multiplications/additions with zero which might get optimized by the compiler or CPU. However, if you find a performance decrease for our rk4 compared to rk4_classic or the NR code, we would surely look into that and try to match the performance.

Oct 25 '13 13:10 mariomulansky

Dear Headmyshoulder, Mario,

Thanks for your fast answers. Yes, the mentioned code is taken originally from Numerical Recipes. However, I might have modified it slightly. (anyway i am not sure as i have been using this version for ages)

Yes, I did performance measurements, as well. I simulated a 3 dimensional 4 body problem (state vectors of length 24) with 6 gravitational interactions. I iterated 3 million steps using Linux x86_64, boost 1.54.0, gcc 4.7.3. The simulator program is on Github: https://github.com/loczaj/bohrbiter . The measured time consumption is in the same magnitude for each steppers but still there is a significant difference. I made 5-5 tests with each stepper.

Odeint runge_kutta4_classic 11.6s 11.5s 11.7s 11.6s 11.7s

Odeint runge_kutta4 13.0s 12.9s 13.1s 13.0s 13.3s

Numerical Recipes 10.2s 10.1s 10.5s 10.2s 10.2s

There is an avarage 27% performance decrease for Odeint rk4 referenced to Numerical Recipes. Odeint rk4_classic is faster than Odeint rk4 but still around 13% slower than Numerical Recipes.

I know my measurement is not an exact laboratory work but it convinced me that extra operator calls cause a performance decrease in Odeint. The derivative is computed 4 times every step in case of each stepper. So I think the only cause of the performance loss can be the unnecessary operator calls.

Odeint steppers runge_kutta4_classic and runge_kutta4 not only differ regarding performance but also regarding numerical result. When I switched to stepper runge_kutta4_classic the simulated trajectories changed slightly referenced to the case when I used stepper runge_kutta4. Is this difference designed or did we find a bug?

Regards, Roland

Oct 25 '13 18:10 loczaj

Dear Roland,

I had a look on your program. It seems, that in your case the abstraction of odeint has some significant performance overhead. I also run your program and I have a performance difference for the classical RK4 and odeint's default RK4 of 7%.

It might be better to use only header-files. It will give you better performance, because the compiler knows more about your code. The compiler knows everything on instantiation of the stepper and during the call of do_step and therefore can apply better optimization.

The differences in the result in not surprising, they originate from the different implementations and the according numerical deviation.

One further remark: you system under consideration is a Hamiltonian one, so the energy and the phase volume are conserved. You should have a look at the symplectic steppers, they conserve phase volume exactly and the energy oscillates with low amplitude around its mean.

Best regards,

Karsten

On 10/25/2013 08:23 PM, Roland Lohner wrote:

Dear Headmyshoulder, Mario,

Thanks for your fast answers. Yes, the mentioned code is taken originally from Numerical Recipes. However, I might have modified it slightly. (anyway i am not sure as i have been using this version for ages)

Yes, I did performance measurements, as well. I simulated a 3 dimensional 4 body problem (state vectors of length 24) with 6 gravitational interactions. I iterated 3 million steps using Linux x86_64, boost 1.54.0, gcc 4.7.3. The simulator program is on Github: https://github.com/loczaj/bohrbiter . The measured time consumption is in the same magnitude for each steppers but still there is a significant difference. I made 5-5 tests with each stepper.

Odeint runge_kutta4_classic 11.6s 11.5s 11.7s 11.6s 11.7s

Odeint runge_kutta4 13.0s 12.9s 13.1s 13.0s 13.3s

Numerical Recipes 10.2s 10.1s 10.5s 10.2s 10.2s

There is an avarage 27% performance decrease for Odeint rk4 referenced to Numerical Recipes. Odeint rk4_classic is faster than Odeint rk4 but still around 13% slower than Numerical Recipes.

I know my measurement is not an exact laboratory work but it convinced me that extra operator calls cause a performance decrease in Odeint. The derivative is computed 4 times every step in case of each stepper. So I think the only cause of the performance loss can be the unnecessary operator calls.

Odeint steppers runge_kutta4_classic and runge_kutta4 not only differ regarding performance but also regarding numerical result. When I switched to stepper runge_kutta4_classic the simulated trajectories changed slightly referenced to the case when I used stepper runge_kutta4. Is this difference designed or did we find a bug?

Regards, Roland

— Reply to this email directly or view it on GitHub https://github.com/headmyshoulder/odeint-v2/issues/104#issuecomment-27114731.

Oct 25 '13 18:10 headmyshoulder

Dear Karsten,

Thanks for reviewing my program and for the hints. Your availability is amazing. :) I also considered using symplectic_rkn_sb3a_mclachlan stepper for Hamiltonian systems.

Thanks for Odeint. Regards, Roland

Oct 26 '13 13:10 loczaj

Hey Roland,

I had a look at your github repo, but it contains only the code for odeint, not the NR version for comparison, does it? Could you add this as well, then i'd have a look on the performance and spot the problems. Thanks!

Oct 28 '13 08:10 mariomulansky

Hi Mario,

I recently added the NR version of the test 'revolution' under the name revolutionNR.cpp. In order to compile this version you have to modify Jamroot.

The numerical problem, iteration number and time step are the same for both versions. Odeint version uses now runge_kutta4_classic stepper which is faster than runge_kutta4.

Thanks for investigating this issue and thanks for Odeint itself. ;)

Regards, Roland

Oct 28 '13 09:10 loczaj

Hey Roland, I've checked your code and I can reproduce your observation of NR < RK4_classic < RK4. It seems the issue is indeed the sub-optimal formulation of the algorithm in odeint which then is heavily pronounced in your example. E.g. by using a vector of pointers as state type which gives non-continuous memory allocation, and operators +,* on the state type which results in unnecessary temporaries. Nevertheless, this example points out a problem in odeint and we will look into that. thanks a lot for reporting the issue.

Oct 28 '13 19:10 mariomulansky

Hi Mario, Karsten, Thanks for looking into this issue. Yes, bohrbiter project might be worth a refactor. Unfortunately I have to write my diploma thesis at the moment. :)

I recently tagged my project with tag 'odeint-test' so that you can use it for testing later, when I commit some new stuff and bugs. ;)

Regards, Roland

Oct 29 '13 14:10 loczaj

odeint-v2 odeint-v2 copied to clipboard

4th order Runke-Kutta performance

odeint-v2
odeint-v2 copied to clipboard