AMGX icon indicating copy to clipboard operation
AMGX copied to clipboard

Classical AMG struggling to converge in distributed mode for singular system

Open joconnor22 opened this issue 4 years ago • 4 comments

First of all, many thanks for building this library. It is very useful and your hard work and effort is very much appreciated.

I have a Pressure Poisson Equation with pure Neumann boundary conditions (singular system) for which a solution exists up to an arbitrary constant. The matrix file (.mtx) and source file (with configuration settings) are attached to reproduce the below tests.

I realise that this probably isn't an application where AmgX would be expected to perform well. Nevertheless, for a single GPU, AmgX converges to a solution with no problems.

Single GPU

         Number of Levels: 2
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)         2601             56637   0.00837       0.000698
           1(D)          483             13183    0.0565       0.000309
         --------------------------------------------------------------
         Grid Complexity: 1.1857
         Operator Complexity: 1.23276
         Total Memory Usage: 0.00100739 GB
         --------------------------------------------------------------
         Total Iterations: 14
         Avg Convergence Rate: 		         0.1242
         Final Residual: 		   1.012780e-05
         Total Reduction in Residual: 	   2.077555e-13
         Maximum Memory Usage: 		          0.916 GB
         --------------------------------------------------------------
Total Time: 0.026293
    setup: 0.015236 s
    solve: 0.011057 s
    solve(per iteration): 0.000789785 s

However, when testing the same problem with the same configuration settings on two GPUs, AmgX does not converge to a solution.

Two GPUs

         Number of Levels: 2
            LVL         ROWS               NNZ    SPRSTY       Mem (GB)
         --------------------------------------------------------------
           0(D)         2601             56637   0.00837       0.000756
           1(D)          486             13222     0.056       0.000394
         --------------------------------------------------------------
         Grid Complexity: 1.18685
         Operator Complexity: 1.23345
         Total Memory Usage: 0.00115038 GB
         --------------------------------------------------------------
         Total Iterations: 100
         Avg Convergence Rate: 		         0.8936
         Final Residual: 		   6.361270e+02
         Total Reduction in Residual: 	   1.304912e-05
         Maximum Memory Usage: 		          1.345 GB
         --------------------------------------------------------------
Total Time: 0.459247
    setup: 0.0367741 s
    solve: 0.422472 s
    solve(per iteration): 0.00422472 s

After playing around with the configuration settings I can get a converged solution by increasing the iterations in the AMG preconditioner step. However, the solve time is approximately 100x the single GPU case.

Two GPUs (precon:max_iters=100)

         Total Iterations: 4
         Avg Convergence Rate: 		         0.0001
         Final Residual: 		   1.430485e-08
         Total Reduction in Residual: 	   2.934410e-16
         Maximum Memory Usage: 		          1.336 GB
         --------------------------------------------------------------
Total Time: 1.19626
    setup: 0.0366399 s
    solve: 1.15962 s
    solve(per iteration): 0.289904 s

To get around this I can obviously pin the solution at a particular point to make the matrix non-singular (which I have tested). However, this isn't always an ideal solution.

From reading the documentation I can see that the distributed version of the Classical AMG is slightly weaker due to coupling issues. Also, like I said before, I wouldn't have expected AmgX to perform particularly well for this type of problem. Nevertheless, I thought the difference in performance between the single-GPU and distributed versions was interesting and was wondering if there are any particular settings which might help with the coupling or overall performance of the distributed version? main.cpp.txt System.mtx.txt

joconnor22 avatar Jun 04 '20 16:06 joconnor22

Poisson's equation with Neuman all around as the boundary is ill-posed, I would not try to use Ax=b for a singular system. The reason iterative solver converges is due to providing an initial guess, if you were to solve that system directly you would get NAN's. You can make this formulation well-posed by removing the mean value without the need to fix it at a given point. check this paper https://asmedigitalcollection.asme.org/IMECE/proceedings-abstract/IMECE2018/52101/V007T09A021/276551.

Jaberh avatar Jun 04 '20 16:06 Jaberh

Thanks for your reply. Yes the system is ill-posed. That's why I didn't expect AmgX to perform particularly well for this case (and was surprised to find the single GPU case seems to work fine). I'm aware of a few techniques around this and will take a look at the link you sent as well.

My question is more directed at why AmgX seems to be able to solve this system fine when only running on one GPU, but then struggles in distributed mode across multiple GPUs. Also if there are any specific config settings I'm missing that could improve the convergence/performance of this system in distributed mode.

joconnor22 avatar Jun 05 '20 10:06 joconnor22

@Jaberh thanks for chipping in!

@joconnor22 Sometimes distributed solver is not identical to the same matrix solved by the same configuration on a single GPU. For example for aggregation multigrid we don't create aggregates that span across multiple ranks. We have previously seen that it might affect convergence and i can imagine that might be even more noticeable with ill-posed problem. However this might be not the only reason of such drastic change of results when switching to distributed solver. What config do you use?

marsaev avatar Jul 28 '20 01:07 marsaev

Thanks for your reply. That confirms what my initial thoughts were.

Here is my config settings. I have tried playing around with these quite a bit, but unless I increase the number of preconditioner iterations (which increases the overall solve time quite a bit) I can't seem to find a combination that works well in distributed mode.

// AMGX config settings
string configStr = "";
configStr += "config_version=2,";
configStr += "verbosity_level=3,";
configStr += "determinism_flag=1,";
configStr += "communicator=MPI,";
configStr += "solver(sol)=GMRES,";
configStr += "sol:print_solve_stats=1,";
configStr += "sol:obtain_timings=1,";
configStr += "sol:convergence=RELATIVE_INI_CORE,";
configStr += "sol:monitor_residual=1,";
configStr += "sol:preconditioner(precon)=AMG,";
configStr += "precon:print_grid_stats=1,";
configStr += "precon:interpolator=D2,";
configStr += "precon:max_iters=1,";

Thanks again for your help.

joconnor22 avatar Jul 28 '20 16:07 joconnor22