MPI_Direct benefits ?
Describe the issue
I see no gain using MPI_Direct during solve except during setup. Environment information:
- OS: Linux
- CUDA runtime: All
- MPI version (if applicable):All
- AMGX version: 2.5.0
- NVIDIA GPU: All
AMGX solver configuration
AmgX config file
config_version=2 solver(s)=PCG s:convergence=RELATIVE_INI_CORE s:tolerance=1.000000e-06 s:preconditioner(p)=AMG s:use_scalar_norm=1 p:error_scaling=0 p:print_grid_stats=1 p:max_iters=1 p:cycle=V p:min_coarse_rows=2 p:max_levels=100 p:smoother(smoother)=BLOCK_JACOBI p:presweeps=1 p:postsweeps=1 p:coarsest_sweeps=1 p:coarse_solver=DENSE_LU_SOLVER p:dense_lu_num_rows=2 p:algorithm=CLASSICAL p:selector=PMIS p:interpolator=D2 p:strength=AHAT p:strength_threshold=0.25 smoother:relaxation_factor=0.8 s:print_config=1 s:print_solve_stats=1 s:obtain_timings=1 s:store_res_history=1 s:monitor_residual=1 s:max_iters=10000
Matrix Data
Constant symmetric matrix
Reproduction steps
Additional context
The MPI library is Cuda-Aware (I check it).
Add any other context about the problem here.
Here are the logs of two runs, one (Left) no MPI_Direct, second (Right) MPI_Direct enabled:
Current_scope:parameter_name(new_scope) = parameter_value Current_scope:parameter_name(new_scope) = parameter_value
> default:communicator = MPI_DIRECT
default:exception_handling = 1 default:exception_handling = 1
default:solver(s) = PCG default:solver(s) = PCG
p:algorithm = CLASSICAL p:algorithm = CLASSICAL
p:coarse_solver = DENSE_LU_SOLVER p:coarse_solver = DENSE_LU_SOLVER
p:coarsest_sweeps = 1 p:coarsest_sweeps = 1
p:cycle = V p:cycle = V
p:dense_lu_num_rows = 2 p:dense_lu_num_rows = 2
p:error_scaling = 0 p:error_scaling = 0
p:interpolator = D2 p:interpolator = D2
p:max_iters = 1 p:max_iters = 1
p:max_levels = 100 p:max_levels = 100
p:min_coarse_rows = 2 p:min_coarse_rows = 2
p:postsweeps = 1 p:postsweeps = 1
p:presweeps = 1 p:presweeps = 1
p:print_grid_stats = 1 p:print_grid_stats = 1
p:selector = PMIS p:selector = PMIS
p:smoother(smoother) = BLOCK_JACOBI p:smoother(smoother) = BLOCK_JACOBI
p:strength = AHAT p:strength = AHAT
p:strength_threshold = 0.25 p:strength_threshold = 0.25
s:convergence = RELATIVE_INI_CORE s:convergence = RELATIVE_INI_CORE
s:max_iters = 10000 s:max_iters = 10000
s:monitor_residual = 1 s:monitor_residual = 1
s:obtain_timings = 1 s:obtain_timings = 1
s:preconditioner(p) = AMG s:preconditioner(p) = AMG
s:print_config = 1 s:print_config = 1
s:print_solve_stats = 1 s:print_solve_stats = 1
s:store_res_history = 1 s:store_res_history = 1
s:tolerance = 1e-06 s:tolerance = 1e-06
s:use_scalar_norm = 1 s:use_scalar_norm = 1
smoother:relaxation_factor = 0.8 smoother:relaxation_factor = 0.8
Using Normal MPI (Hostbuffer) communicator... | Using CUDA-Aware MPI (GPU Direct) communicator...
AMG Grid: AMG Grid:
Number of Levels: 6 Number of Levels: 6
LVL ROWS NNZ PARTS SPRS LVL ROWS NNZ PARTS SPRS
----------------------------------------------------- -----------------------------------------------------
0(D) 104190874 1379908634 128 1.27e- 0(D) 104190874 1379908634 128 1.27e-
1(D) 32061612 6265438280 128 6.1e- 1(D) 32061612 6265438280 128 6.1e-
2(D) 6633608 3789277270 128 8.61e- 2(D) 6633608 3789277270 128 8.61e-
3(D) 876879 1916340133 128 0.002 3(D) 876879 1916340133 128 0.002
4(D) 53051 90722549 128 0.03 4(D) 53051 90722549 128 0.03
5(D) 2986 6035356 128 0.6 5(D) 2986 6035356 128 0.6
---------------------------------------------------- ----------------------------------------------------
Grid Complexity: 1.38034 Grid Complexity: 1.38034
Operator Complexity: 9.74537 Operator Complexity: 9.74537
Total Memory Usage: 784.462 GB Total Memory Usage: 784.462 GB
---------------------------------------------------- ----------------------------------------------------
[AmgX] Time to set matrix (copy+setup) on GPU: 54.8983 | [AmgX] Time to set matrix (copy+setup) on GPU: 32.1811
Order of the PETSc matrix : 104190874 (~ 818443 unknowns per Order of the PETSc matrix : 104190874 (~ 818443 unknowns per
iter Mem Usage (GB) residual iter Mem Usage (GB) residual
---------------------------------------------------- ----------------------------------------------------
Ini 19.0435 5.211793e-02 | Ini 19.1255 5.211793e-02
0 19.0435 6.245695e-03 0. | 0 19.1255 6.245695e-03 0.
1 19.0435 1.509051e-03 0. | 1 19.1255 1.509051e-03 0.
2 19.0435 6.103571e-04 0. | 2 19.1255 6.103571e-04 0.
...
149 19.0435 6.567836e-08 0. | 149 19.1255 6.567819e-08 0.
150 19.0435 6.077506e-08 0. | 150 19.1255 6.077435e-08 0.
151 19.0435 5.702194e-08 0. | 151 19.1255 5.701868e-08 0.
152 19.0435 5.471054e-08 0. | 152 19.1255 5.469556e-08 0.
153 19.0435 5.226345e-08 0. | 153 19.1255 5.219956e-08 0.
154 19.0435 5.051133e-08 0. | 154 19.1255 5.022366e-08 0.
---------------------------------------------------- ----------------------------------------------------
Total Iterations: 155 Total Iterations: 155
Avg Convergence Rate: 0.9145 Avg Convergence Rate: 0.9145
Final Residual: 5.051133e-08 | Final Residual: 5.022366e-08
Total Reduction in Residual: 9.691737e-07 | Total Reduction in Residual: 9.636541e-07
Maximum Memory Usage: 19.043 GB | Maximum Memory Usage: 19.125 GB
---------------------------------------------------- | ----------------------------------------------------
Total Time: 41.2655 | Total Time: 18.7393
setup: 39.093 s | setup: 16.5902 s
solve: 2.17245 s | solve: 2.14901 s
solve(per iteration): 0.0140158 s | solve(per iteration): 0.0138646 s
Is it normal that successive solves on this constant matrix linear system don't benefit from MPI_Direct or I am missing something ?
Thanks,
NB: The problem solved is incompressible flow on ~220M cells with 128 H100 GPUs. The code used has MPI communications which benefit from the MPI Cuda Aware library.