aspect
aspect copied to clipboard
Use deal.II CG solver instead of trilinos
We use CG solvers for the inner A and S blocks of the preconditioner when doing the expensive iterations. For the GMG preconditioner we use the CG solver provided by deal.II in deal.II/lac/solver_cg.h
, but for the AMG preconditioner we use the CG solver provided by trilinos through deal.II/lac/trilinos_solver.h
. When working on #5613 I tried to use the deal.II CG solver for the AMG preconditioner instead. This does not make a difference for the number of A or S iterations (they are literally identical) or the solution (also identical), but the deal.II solver seems to be consistently faster by about 10-20% of the total Stokes solver time (including the outer GMRES iterations).
Here is an example for the cookbook transform_fault_behn_2007/temperature_dependent.prm
run with the AMG preconditioner and only expensive Stokes iterations on Frontera:
With trilinos CG:
-----------------------------------------------------------------------------
-- This is ASPECT --
-- The Advanced Solver for Planetary Evolution, Convection, and Tectonics. --
-----------------------------------------------------------------------------
-- . version 2.6.0-pre (main, b2e93b7a5)
-- . using deal.II 9.5.2
-- . with 64 bit indices and vectorization level 3 (512 bits)
-- . using Trilinos 13.2.0
-- . using p4est 2.3.2
-- . using Geodynamic World Builder 0.5.0
-- . running in OPTIMIZED mode
-- . running with 56 MPI processes
-----------------------------------------------------------------------------
...
+----------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 98.6s | |
| | | |
| Section | no. calls | wall time | % of total |
+----------------------------------+-----------+------------+------------+
| Assemble Stokes system | 10 | 19.5s | 20% |
| Assemble temperature system | 10 | 10.3s | 10% |
| Build Stokes preconditioner | 10 | 15.8s | 16% |
| Build temperature preconditioner | 10 | 0.969s | 0.98% |
| Initialization | 1 | 0.159s | 0.16% |
| Postprocessing | 4 | 0.264s | 0.27% |
| Setup dof systems | 1 | 0.154s | 0.16% |
| Setup initial conditions | 1 | 0.125s | 0.13% |
| Setup matrices | 1 | 1.25s | 1.3% |
| Solve Stokes system | 10 | 48.9s | 50% |
| Solve temperature system | 10 | 0.731s | 0.74% |
+----------------------------------+-----------+------------+------------+
With this branch:
+----------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 91.3s | |
| | | |
| Section | no. calls | wall time | % of total |
+----------------------------------+-----------+------------+------------+
| Assemble Stokes system | 10 | 19.2s | 21% |
| Assemble temperature system | 10 | 10.6s | 12% |
| Build Stokes preconditioner | 10 | 15.8s | 17% |
| Build temperature preconditioner | 10 | 0.967s | 1.1% |
| Initialization | 1 | 0.154s | 0.17% |
| Postprocessing | 4 | 0.273s | 0.3% |
| Setup dof systems | 1 | 0.159s | 0.17% |
| Setup initial conditions | 1 | 0.125s | 0.14% |
| Setup matrices | 1 | 1.25s | 1.4% |
| Solve Stokes system | 10 | 41.8s | 46% |
| Solve temperature system | 10 | 0.518s | 0.57% |
+----------------------------------+-----------+------------+------------+
To me it looks like the deal.II version is clearly superior and we should use that one. Since this is the version that is also used in the GMG solver, we know it scales well to large problem sizes. Is there any other reason to use the trilinos solver that I didnt see?
The only thing I could think of is that they might have an implementation that does some tricks to give better scalability (avoiding communication or hiding latency). I am surprised how much faster our implementation is!
This fails on the following tests:
The following tests FAILED:
594 - parallel_output_group_0 (Failed)
595 - parallel_output_group_1 (Failed)
596 - parallel_output_group_2 (Failed)
691 - point_value_02_mpi (Failed)
891 - stokes_solver_fail_A (Failed)
892 - stokes_solver_fail_S (Failed)
and on another tester with
The following tests FAILED:
27 - advection_solver_fail (Failed)
157 - checkpoint_07_enable_free_surface_resume (Failed)
891 - stokes_solver_fail_A (Failed)
892 - stokes_solver_fail_S (Failed)
There is not very much one can optimize about CG (unlike GMRES, for example). So this looks ok to me, and I'm not overly concerned about it making a difference on large parallel jobs.
I rebased the PR and updated the few changed test results. Most of the changed results are only marginally different.
The only significant differences are in the output of solve_stokes_fail_A
and solve_stokes_fail_S
, where the deal.II solver reports a different exception than the trilinos solver. I updated that output too.