sundials icon indicating copy to clipboard operation
sundials copied to clipboard

Looking for speed improvements with large linear ode

Open mottelet opened this issue 2 years ago • 9 comments

Hello, I am trying to accelerate the resolution of a large linear ode (heat equation in a cube with finite differences) with CVODE/BDF by using Conjugate Gradient method (without preconditionning, which seems to be strangely faster than diagonal precond ?). I thought that the computation of the right hand side was the dominating cost and tried to speed up things by using 8 threads (std::threads) to compute each 8th of the rhs in the cube (tests have been done on a 20 cores machine). However, I dont see a big improvement of elapsed time, less that 10%. What did I miss ?

Thanks for your insights !

mottelet avatar Sep 30 '22 13:09 mottelet

How large (in terms of order of magnitude)?

Did you measure the RHS execution time and the overall execution time? You can use the built-in SUNDIALS profiler to get a break down of where time is spent inside of SUNDIALS and you can add a timing region to your RHS. See https://sundials.readthedocs.io/en/latest/sundials/Profiling_link.html.

With the profile we could offer better advice.

balos1 avatar Sep 30 '22 17:09 balos1

a 10^6 eqs example (100x100x100 nodes). Thanks for the profiling idea I will do that this Monday !

mottelet avatar Sep 30 '22 17:09 mottelet

Hello, with this profiling (tests with SUNDIALS 6.2.0 on a OLD Linux server)

CVode                                   	  99.95% 	         34.415714s 	 34.415714s 	 1
SUNNonlinSolSolve                       	  84.63% 	         29.141854s 	 29.141854s 	 183
SUNLinSolSolve                          	  73.17% 	         25.194771s 	 25.194771s 	 190
N_VLinearSum                            	  34.81% 	         11.987342s 	 11.987342s 	 6948
N_VDotProd                              	  17.28% 	         5.950969s 	 5.950969s 	 4136
N_VWrmsNorm                             	  15.04% 	         5.179683s 	 5.179683s 	 1648
RHS computation                         	  10.16% 	         3.500166s 	 3.500166s 	 1134
N_VScale                                	   9.32% 	         3.208320s 	 3.208320s 	 2334

I realized that even by using domain decomposition the RHS computation cost is not the dominant cost. Hence I modified my code in order to use the PThreads implementation of NVector module. I obtain the following profile (with 8 threads)

CVode                                   	  99.93% 	         18.155444s 	 18.155444s 	 1
SUNNonlinSolSolve                       	  86.32% 	         15.682584s 	 15.682584s 	 237
SUNLinSolSolve                          	  74.00% 	         13.443851s 	 13.443851s 	 257
N_VLinearSum                            	  32.61% 	         5.924322s 	 5.924322s 	 9334
RHS computation                         	  18.64% 	         3.386912s 	 3.386912s 	 1536
N_VDotProd                              	  14.83% 	         2.694207s 	 2.694207s 	 5610
N_VWrmsNorm                             	  12.66% 	         2.300057s 	 2.300057s 	 2171
N_VScale                                	   9.19% 	         1.669280s 	 1.669280s 	 3113

which shows a 2x speedup. However I am very confused about the different number of counts (last column), since I only changed the NVector implementation.

To investigate this I made the same test on my macOS machine (which has only 4 physical cores + hyperthreading). I got the same counts for the NVector serial implementation (but faster, the machine is quite recent compared to the Linux server):

CVode                                   	  99.96% 	         20.328756s 	 20.328756s 	 1
SUNNonlinSolSolve                       	  87.33% 	         17.759588s 	 17.759588s 	 183
SUNLinSolSolve                          	  76.40% 	         15.537267s 	 15.537267s 	 190
N_VDotProd                              	  29.91% 	         6.082517s 	 6.082517s 	 4136
N_VLinearSum                            	  23.16% 	         4.709540s 	 4.709540s 	 6948
N_VWrmsNorm                             	  21.76% 	         4.424661s 	 4.424661s 	 1648
RHS computation                         	  10.29% 	         2.093031s 	 2.093031s 	 1134
N_VScale                                	   6.46% 	         1.314471s 	 1.314471s 	 2334

The Nvector PThreads is also twice faster, but I also got different counts, which also differ from the Linux version:

CVode                                   	  99.90% 	         10.755109s 	 10.755109s 	 1
SUNNonlinSolSolve                       	  86.69% 	         9.332761s 	 9.332761s 	 192
SUNLinSolSolve                          	  74.33% 	         8.001688s 	 8.001688s 	 206
N_VLinearSum                            	  30.65% 	         3.300109s 	 3.300109s 	 7413
RHS computation                         	  21.29% 	         2.291468s 	 2.291468s 	 1228
N_VDotProd                              	  13.98% 	         1.504504s 	 1.504504s 	 4480
N_VWrmsNorm                             	  12.87% 	         1.385081s 	 1.385081s 	 1721
N_VScale                                	   9.17% 	         0.986839s 	 0.986839s 	 2488

I have played further with the number of threads and the counts change each time. What could explain these differences ?

mottelet avatar Oct 03 '22 09:10 mottelet

I have played further with the number of threads and the counts change each time. What could explain these differences ?

How are you decomposing the problem w.r.t. threading? Are you only using threads in the RHS and via the PThreads NVector? Or are you solving multiple ODE systems in different threads by creating multiple CVode instances?

Also, how are you setting up the profiling? Are you creating a SUNProfiler object with SUNProfiler_Create or are you using the default one that is created and attached to the SUNContext object?

balos1 avatar Oct 03 '22 18:10 balos1

You can get the source there: suncube_pt.cpp.zip

I use threads in the RHS computation (with std::thread) and also via the NVector. A dedicated SUNProfiler is created and attached. I tested it further this evening by removing all the profiling stuff (I also recompiled SUNDIALS without the profiling feature), removed the use of std::thread for the RHS and just displayed the overall statistics. My conclusion is that the use of 1 thread gives the same number of solver steps than using NVector_serial, and that 2,4 or 8 threads yield a larger number of steps. How can the number of threads change the solver results themselves ?

mottelet avatar Oct 03 '22 18:10 mottelet

A couple of things I noticed in the code that I doubt are the source of the problem, but may lead to strange behavior.

  1. It seems you are using the SUNProfiler created automatically by the SUNContext

  SUNContext_Create(NULL, &sunctx);
  SUNContext_GetProfiler(sunctx, &profobj);

Since you are doing this, you should not call SUNProfiler_Free at the end. The SUNContext will handle deallocating this SUNProfiler when SUNContext_Free is called.

  1. On line 111 you call SUNProfiler_End(profobj,"RHS computation"); but this is not needed because the SUNDIALS_MARK_END(profobj, "RHS computation"); on line 39 handles this already.

Now in regards to the seemingly different behavior based on threading or not and the system used. Some problems (such as the Roberts problem that comes up in our examples) can be sensitive to the non-determinism associated with threading. This may or may not be the case for your problem. Here are some next steps that you may take to understand what is going on:

  1. Output the solution for all runs so that they can be compared. It is important to understand if the solution you are getting is differing (and how so) and the solution behavior itself. When you do further runs, compare solutions as well as the 'number of calls' or other integrator statistics.
  2. Try using threads only in the RHS. I.e., use threads in the RHS, but use the serial N_Vector.
  3. Try using the Pthreads N_Vector but without threading in the RHS.

balos1 avatar Oct 04 '22 18:10 balos1

Thanks for pointing my mistakes.

As you said the variability of results is due to the combination of non-associativity of IEEE754 addition together with the unpredictability of thread calling order. To illustrate this, for each state dimension (keeping the same number of threads, 8), I repeated the computations 10 times. The variability of results (e.g. number of steps) seem to increase with the dimension of the problem, but the expected precision (rtol=1e-4) seems to be respected (as seen on component NEQ/3 of the state). Threading or not threading the RHS computation does not seem to change the deal, neither taking a state dimension as a multiple of the thread numbers (below results for Pthreads N_Vector and RHS threading):

NEQ=64000, NTHREADS=8:
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213778e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213778e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213780e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213778e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213781e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213793e+00
nsteps=83, nfevals=107, U[NEQ/3]=1.37213779e+00
NEQ=512000, NTHREADS=8:
nsteps=162, nfevals=217, U[NEQ/3]=1.37600681e+00
nsteps=156, nfevals=205, U[NEQ/3]=1.37591944e+00
nsteps=158, nfevals=209, U[NEQ/3]=1.37581862e+00
nsteps=163, nfevals=212, U[NEQ/3]=1.37565607e+00
nsteps=168, nfevals=221, U[NEQ/3]=1.37559596e+00
nsteps=159, nfevals=212, U[NEQ/3]=1.37592376e+00
nsteps=161, nfevals=217, U[NEQ/3]=1.37627769e+00
nsteps=163, nfevals=210, U[NEQ/3]=1.37544133e+00
nsteps=168, nfevals=223, U[NEQ/3]=1.37564804e+00
nsteps=155, nfevals=204, U[NEQ/3]=1.37634534e+00
NEQ=1124864, NTHREADS=8:
nsteps=195, nfevals=261, U[NEQ/3]=1.37615174e+00
nsteps=195, nfevals=261, U[NEQ/3]=1.37615809e+00
nsteps=209, nfevals=284, U[NEQ/3]=1.37553998e+00
nsteps=194, nfevals=261, U[NEQ/3]=1.37660335e+00
nsteps=189, nfevals=253, U[NEQ/3]=1.37720851e+00
nsteps=184, nfevals=243, U[NEQ/3]=1.37617627e+00
nsteps=187, nfevals=245, U[NEQ/3]=1.37731879e+00
nsteps=186, nfevals=244, U[NEQ/3]=1.37714674e+00
nsteps=192, nfevals=259, U[NEQ/3]=1.37665020e+00
nsteps=186, nfevals=246, U[NEQ/3]=1.37660197e+00
NEQ=2097152, NTHREADS=8:
nsteps=317, nfevals=464, U[NEQ/3]=1.37671663e+00
nsteps=286, nfevals=412, U[NEQ/3]=1.37605685e+00
nsteps=301, nfevals=433, U[NEQ/3]=1.37615913e+00
nsteps=295, nfevals=425, U[NEQ/3]=1.37641266e+00
nsteps=306, nfevals=445, U[NEQ/3]=1.37581009e+00
nsteps=281, nfevals=402, U[NEQ/3]=1.37658148e+00
nsteps=282, nfevals=410, U[NEQ/3]=1.37629313e+00
nsteps=305, nfevals=442, U[NEQ/3]=1.37664843e+00
nsteps=285, nfevals=408, U[NEQ/3]=1.37604755e+00
nsteps=297, nfevals=423, U[NEQ/3]=1.37836600e+00

Source of the program here: suncube_pt.cpp.zip

mottelet avatar Oct 05 '22 09:10 mottelet

Closing now, but please reopen if the discussion should be continued.

balos1 avatar Jun 20 '23 17:06 balos1

Hi All

Just one more quick comment on this. The heat equation is parabolic and will need an effective preconditioner for solution. The profiling is showing the largest time in the linear solver and the corresponding vector operations which we woudl expect for an unpreconditioned heat equation. I strongly suggest you try out either hypre (for large problems) using either their PFMG for structured grid problems or AMG if you are running on an unstructured grid. Another option is the Ginkgo library for fast solves with sparse systems. This will also assume an unstructured grid discretization. Note that both these libraries will need the Jacobian matrix. We have a parallel heat equation example using hypre somewhere that you can use as a reference that may help.

cswoodward avatar Jun 20 '23 18:06 cswoodward

Closing due to inactivity.

balos1 avatar Mar 29 '24 21:03 balos1