sundials
sundials copied to clipboard
Looking for speed improvements with large linear ode
Hello, I am trying to accelerate the resolution of a large linear ode (heat equation in a cube with finite differences) with CVODE/BDF by using Conjugate Gradient method (without preconditionning, which seems to be strangely faster than diagonal precond ?). I thought that the computation of the right hand side was the dominating cost and tried to speed up things by using 8 threads (std::threads) to compute each 8th of the rhs in the cube (tests have been done on a 20 cores machine). However, I dont see a big improvement of elapsed time, less that 10%. What did I miss ?
Thanks for your insights !
How large (in terms of order of magnitude)?
Did you measure the RHS execution time and the overall execution time? You can use the built-in SUNDIALS profiler to get a break down of where time is spent inside of SUNDIALS and you can add a timing region to your RHS. See https://sundials.readthedocs.io/en/latest/sundials/Profiling_link.html.
With the profile we could offer better advice.
a 10^6 eqs example (100x100x100 nodes). Thanks for the profiling idea I will do that this Monday !
Hello, with this profiling (tests with SUNDIALS 6.2.0 on a OLD Linux server)
CVode 99.95% 34.415714s 34.415714s 1
SUNNonlinSolSolve 84.63% 29.141854s 29.141854s 183
SUNLinSolSolve 73.17% 25.194771s 25.194771s 190
N_VLinearSum 34.81% 11.987342s 11.987342s 6948
N_VDotProd 17.28% 5.950969s 5.950969s 4136
N_VWrmsNorm 15.04% 5.179683s 5.179683s 1648
RHS computation 10.16% 3.500166s 3.500166s 1134
N_VScale 9.32% 3.208320s 3.208320s 2334
I realized that even by using domain decomposition the RHS computation
cost is not the dominant cost. Hence I modified my code in order to use the PThreads implementation of NVector module. I obtain the following profile (with 8 threads)
CVode 99.93% 18.155444s 18.155444s 1
SUNNonlinSolSolve 86.32% 15.682584s 15.682584s 237
SUNLinSolSolve 74.00% 13.443851s 13.443851s 257
N_VLinearSum 32.61% 5.924322s 5.924322s 9334
RHS computation 18.64% 3.386912s 3.386912s 1536
N_VDotProd 14.83% 2.694207s 2.694207s 5610
N_VWrmsNorm 12.66% 2.300057s 2.300057s 2171
N_VScale 9.19% 1.669280s 1.669280s 3113
which shows a 2x speedup. However I am very confused about the different number of counts (last column), since I only changed the NVector implementation.
To investigate this I made the same test on my macOS machine (which has only 4 physical cores + hyperthreading). I got the same counts for the NVector serial implementation (but faster, the machine is quite recent compared to the Linux server):
CVode 99.96% 20.328756s 20.328756s 1
SUNNonlinSolSolve 87.33% 17.759588s 17.759588s 183
SUNLinSolSolve 76.40% 15.537267s 15.537267s 190
N_VDotProd 29.91% 6.082517s 6.082517s 4136
N_VLinearSum 23.16% 4.709540s 4.709540s 6948
N_VWrmsNorm 21.76% 4.424661s 4.424661s 1648
RHS computation 10.29% 2.093031s 2.093031s 1134
N_VScale 6.46% 1.314471s 1.314471s 2334
The Nvector PThreads is also twice faster, but I also got different counts, which also differ from the Linux version:
CVode 99.90% 10.755109s 10.755109s 1
SUNNonlinSolSolve 86.69% 9.332761s 9.332761s 192
SUNLinSolSolve 74.33% 8.001688s 8.001688s 206
N_VLinearSum 30.65% 3.300109s 3.300109s 7413
RHS computation 21.29% 2.291468s 2.291468s 1228
N_VDotProd 13.98% 1.504504s 1.504504s 4480
N_VWrmsNorm 12.87% 1.385081s 1.385081s 1721
N_VScale 9.17% 0.986839s 0.986839s 2488
I have played further with the number of threads and the counts change each time. What could explain these differences ?
I have played further with the number of threads and the counts change each time. What could explain these differences ?
How are you decomposing the problem w.r.t. threading? Are you only using threads in the RHS and via the PThreads NVector? Or are you solving multiple ODE systems in different threads by creating multiple CVode instances?
Also, how are you setting up the profiling? Are you creating a SUNProfiler
object with SUNProfiler_Create
or are you using the default one that is created and attached to the SUNContext
object?
You can get the source there: suncube_pt.cpp.zip
I use threads in the RHS computation (with std::thread) and also via the NVector. A dedicated SUNProfiler is created and attached. I tested it further this evening by removing all the profiling stuff (I also recompiled SUNDIALS without the profiling feature), removed the use of std::thread for the RHS and just displayed the overall statistics. My conclusion is that the use of 1 thread gives the same number of solver steps than using NVector_serial, and that 2,4 or 8 threads yield a larger number of steps. How can the number of threads change the solver results themselves ?
A couple of things I noticed in the code that I doubt are the source of the problem, but may lead to strange behavior.
- It seems you are using the
SUNProfiler
created automatically by theSUNContext
SUNContext_Create(NULL, &sunctx);
SUNContext_GetProfiler(sunctx, &profobj);
Since you are doing this, you should not call SUNProfiler_Free
at the end. The SUNContext
will handle deallocating this SUNProfiler
when SUNContext_Free
is called.
- On line 111 you call
SUNProfiler_End(profobj,"RHS computation");
but this is not needed because theSUNDIALS_MARK_END(profobj, "RHS computation");
on line 39 handles this already.
Now in regards to the seemingly different behavior based on threading or not and the system used. Some problems (such as the Roberts problem that comes up in our examples) can be sensitive to the non-determinism associated with threading. This may or may not be the case for your problem. Here are some next steps that you may take to understand what is going on:
- Output the solution for all runs so that they can be compared. It is important to understand if the solution you are getting is differing (and how so) and the solution behavior itself. When you do further runs, compare solutions as well as the 'number of calls' or other integrator statistics.
- Try using threads only in the RHS. I.e., use threads in the RHS, but use the serial N_Vector.
- Try using the Pthreads N_Vector but without threading in the RHS.
Thanks for pointing my mistakes.
As you said the variability of results is due to the combination of non-associativity of IEEE754 addition together with the unpredictability of thread calling order. To illustrate this, for each state dimension (keeping the same number of threads, 8), I repeated the computations 10 times. The variability of results (e.g. number of steps) seem to increase with the dimension of the problem, but the expected precision (rtol=1e-4) seems to be respected (as seen on component NEQ/3 of the state). Threading or not threading the RHS computation does not seem to change the deal, neither taking a state dimension as a multiple of the thread numbers (below results for Pthreads N_Vector and RHS threading):
NEQ=64000, NTHREADS=8:
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213778e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213778e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213780e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213778e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213781e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213793e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
NEQ=512000, NTHREADS=8:
nsteps=162, nfevals=217, U[NEQ/3]=1.37600681e+00
nsteps=156, nfevals=205, U[NEQ/3]=1.37591944e+00
nsteps=158, nfevals=209, U[NEQ/3]=1.37581862e+00
nsteps=163, nfevals=212, U[NEQ/3]=1.37565607e+00
nsteps=168, nfevals=221, U[NEQ/3]=1.37559596e+00
nsteps=159, nfevals=212, U[NEQ/3]=1.37592376e+00
nsteps=161, nfevals=217, U[NEQ/3]=1.37627769e+00
nsteps=163, nfevals=210, U[NEQ/3]=1.37544133e+00
nsteps=168, nfevals=223, U[NEQ/3]=1.37564804e+00
nsteps=155, nfevals=204, U[NEQ/3]=1.37634534e+00
NEQ=1124864, NTHREADS=8:
nsteps=195, nfevals=261, U[NEQ/3]=1.37615174e+00
nsteps=195, nfevals=261, U[NEQ/3]=1.37615809e+00
nsteps=209, nfevals=284, U[NEQ/3]=1.37553998e+00
nsteps=194, nfevals=261, U[NEQ/3]=1.37660335e+00
nsteps=189, nfevals=253, U[NEQ/3]=1.37720851e+00
nsteps=184, nfevals=243, U[NEQ/3]=1.37617627e+00
nsteps=187, nfevals=245, U[NEQ/3]=1.37731879e+00
nsteps=186, nfevals=244, U[NEQ/3]=1.37714674e+00
nsteps=192, nfevals=259, U[NEQ/3]=1.37665020e+00
nsteps=186, nfevals=246, U[NEQ/3]=1.37660197e+00
NEQ=2097152, NTHREADS=8:
nsteps=317, nfevals=464, U[NEQ/3]=1.37671663e+00
nsteps=286, nfevals=412, U[NEQ/3]=1.37605685e+00
nsteps=301, nfevals=433, U[NEQ/3]=1.37615913e+00
nsteps=295, nfevals=425, U[NEQ/3]=1.37641266e+00
nsteps=306, nfevals=445, U[NEQ/3]=1.37581009e+00
nsteps=281, nfevals=402, U[NEQ/3]=1.37658148e+00
nsteps=282, nfevals=410, U[NEQ/3]=1.37629313e+00
nsteps=305, nfevals=442, U[NEQ/3]=1.37664843e+00
nsteps=285, nfevals=408, U[NEQ/3]=1.37604755e+00
nsteps=297, nfevals=423, U[NEQ/3]=1.37836600e+00
Source of the program here: suncube_pt.cpp.zip
Closing now, but please reopen if the discussion should be continued.
Hi All
Just one more quick comment on this. The heat equation is parabolic and will need an effective preconditioner for solution. The profiling is showing the largest time in the linear solver and the corresponding vector operations which we woudl expect for an unpreconditioned heat equation. I strongly suggest you try out either hypre (for large problems) using either their PFMG for structured grid problems or AMG if you are running on an unstructured grid. Another option is the Ginkgo library for fast solves with sparse systems. This will also assume an unstructured grid discretization. Note that both these libraries will need the Jacobian matrix. We have a parallel heat equation example using hypre somewhere that you can use as a reference that may help.
Closing due to inactivity.