MPI_Direct benefits ?

Open pledac opened this issue 9 months ago • 1 comments

Describe the issue

I see no gain using MPI_Direct during solve except during setup. Environment information:

OS: Linux
CUDA runtime: All
MPI version (if applicable):All
AMGX version: 2.5.0
NVIDIA GPU: All

AMGX solver configuration

AmgX config file

config_version=2 solver(s)=PCG s:convergence=RELATIVE_INI_CORE s:tolerance=1.000000e-06 s:preconditioner(p)=AMG s:use_scalar_norm=1 p:error_scaling=0 p:print_grid_stats=1 p:max_iters=1 p:cycle=V p:min_coarse_rows=2 p:max_levels=100 p:smoother(smoother)=BLOCK_JACOBI p:presweeps=1 p:postsweeps=1 p:coarsest_sweeps=1 p:coarse_solver=DENSE_LU_SOLVER p:dense_lu_num_rows=2 p:algorithm=CLASSICAL p:selector=PMIS p:interpolator=D2 p:strength=AHAT p:strength_threshold=0.25 smoother:relaxation_factor=0.8 s:print_config=1 s:print_solve_stats=1 s:obtain_timings=1 s:store_res_history=1 s:monitor_residual=1 s:max_iters=10000

Matrix Data

Constant symmetric matrix

Reproduction steps

Additional context

The MPI library is Cuda-Aware (I check it).

Add any other context about the problem here.

Mar 10 '25 12:03 pledac

Here are the logs of two runs, one (Left) no MPI_Direct, second (Right) MPI_Direct enabled:

Current_scope:parameter_name(new_scope) = parameter_value       Current_scope:parameter_name(new_scope) = parameter_value

                                                              >             default:communicator = MPI_DIRECT
            default:exception_handling = 1                                  default:exception_handling = 1
            default:solver(s) = PCG                                         default:solver(s) = PCG
            p:algorithm = CLASSICAL                                         p:algorithm = CLASSICAL
            p:coarse_solver = DENSE_LU_SOLVER                               p:coarse_solver = DENSE_LU_SOLVER
            p:coarsest_sweeps = 1                                           p:coarsest_sweeps = 1
            p:cycle = V                                                     p:cycle = V
            p:dense_lu_num_rows = 2                                         p:dense_lu_num_rows = 2
            p:error_scaling = 0                                             p:error_scaling = 0
            p:interpolator = D2                                             p:interpolator = D2
            p:max_iters = 1                                                 p:max_iters = 1
            p:max_levels = 100                                              p:max_levels = 100
            p:min_coarse_rows = 2                                           p:min_coarse_rows = 2
            p:postsweeps = 1                                                p:postsweeps = 1
            p:presweeps = 1                                                 p:presweeps = 1
            p:print_grid_stats = 1                                          p:print_grid_stats = 1
            p:selector = PMIS                                               p:selector = PMIS
            p:smoother(smoother) = BLOCK_JACOBI                             p:smoother(smoother) = BLOCK_JACOBI
            p:strength = AHAT                                               p:strength = AHAT
            p:strength_threshold = 0.25                                     p:strength_threshold = 0.25
            s:convergence = RELATIVE_INI_CORE                               s:convergence = RELATIVE_INI_CORE
            s:max_iters = 10000                                             s:max_iters = 10000
            s:monitor_residual = 1                                          s:monitor_residual = 1
            s:obtain_timings = 1                                            s:obtain_timings = 1
            s:preconditioner(p) = AMG                                       s:preconditioner(p) = AMG
            s:print_config = 1                                              s:print_config = 1
            s:print_solve_stats = 1                                         s:print_solve_stats = 1
            s:store_res_history = 1                                         s:store_res_history = 1
            s:tolerance = 1e-06                                             s:tolerance = 1e-06
            s:use_scalar_norm = 1                                           s:use_scalar_norm = 1
            smoother:relaxation_factor = 0.8                                smoother:relaxation_factor = 0.8

Using Normal MPI (Hostbuffer) communicator...                 | Using CUDA-Aware MPI (GPU Direct) communicator...
AMG Grid:                                                       AMG Grid:
         Number of Levels: 6                                             Number of Levels: 6
            LVL         ROWS               NNZ  PARTS    SPRS               LVL         ROWS               NNZ  PARTS    SPRS
        -----------------------------------------------------           -----------------------------------------------------
           0(D)    104190874        1379908634    128  1.27e-              0(D)    104190874        1379908634    128  1.27e-
           1(D)     32061612        6265438280    128   6.1e-              1(D)     32061612        6265438280    128   6.1e-
          2(D)      6633608        3789277270    128  8.61e-              2(D)      6633608        3789277270    128  8.61e-
           3(D)       876879        1916340133    128   0.002              3(D)       876879        1916340133    128   0.002
           4(D)        53051          90722549    128    0.03              4(D)        53051          90722549    128    0.03
           5(D)         2986           6035356    128     0.6              5(D)         2986           6035356    128     0.6
         ----------------------------------------------------            ----------------------------------------------------
         Grid Complexity: 1.38034                                        Grid Complexity: 1.38034
         Operator Complexity: 9.74537                                    Operator Complexity: 9.74537
         Total Memory Usage: 784.462 GB                                  Total Memory Usage: 784.462 GB
         ----------------------------------------------------            ----------------------------------------------------
[AmgX] Time to set matrix (copy+setup) on GPU: 54.8983        | [AmgX] Time to set matrix (copy+setup) on GPU: 32.1811
Order of the PETSc matrix : 104190874 (~ 818443 unknowns per    Order of the PETSc matrix : 104190874 (~ 818443 unknowns per 
           iter      Mem Usage (GB)       residual                         iter      Mem Usage (GB)       residual           
         ----------------------------------------------------            ----------------------------------------------------
            Ini             19.0435   5.211793e-02            |             Ini             19.1255   5.211793e-02
              0             19.0435   6.245695e-03         0. |               0             19.1255   6.245695e-03         0.
              1             19.0435   1.509051e-03         0. |               1             19.1255   1.509051e-03         0.
              2             19.0435   6.103571e-04         0. |               2             19.1255   6.103571e-04         0.
...
           149             19.0435   6.567836e-08         0. |             149             19.1255   6.567819e-08         0.
            150             19.0435   6.077506e-08         0. |             150             19.1255   6.077435e-08         0.
            151             19.0435   5.702194e-08         0. |             151             19.1255   5.701868e-08         0.
            152             19.0435   5.471054e-08         0. |             152             19.1255   5.469556e-08         0.
            153             19.0435   5.226345e-08         0. |             153             19.1255   5.219956e-08         0.
            154             19.0435   5.051133e-08         0. |             154             19.1255   5.022366e-08         0.
         ----------------------------------------------------            ----------------------------------------------------
         Total Iterations: 155                                           Total Iterations: 155
         Avg Convergence Rate:                   0.9145                  Avg Convergence Rate:                   0.9145
         Final Residual:                   5.051133e-08       |          Final Residual:                   5.022366e-08
         Total Reduction in Residual:      9.691737e-07       |          Total Reduction in Residual:      9.636541e-07
         Maximum Memory Usage:                   19.043 GB    |          Maximum Memory Usage:                   19.125 GB
         ---------------------------------------------------- |          ----------------------------------------------------
Total Time: 41.2655                                           | Total Time: 18.7393
    setup: 39.093 s                                           |     setup: 16.5902 s
    solve: 2.17245 s                                          |     solve: 2.14901 s
    solve(per iteration): 0.0140158 s                         |     solve(per iteration): 0.0138646 s

Is it normal that successive solves on this constant matrix linear system don't benefit from MPI_Direct or I am missing something ?

Thanks,

NB: The problem solved is incompressible flow on ~220M cells with 128 H100 GPUs. The code used has MPI communications which benefit from the MPI Cuda Aware library.

Mar 10 '25 12:03 pledac