HYPRE Struct - problems using GPU-aware MPI
Dear HYPRE developers,
following on issue #1126, I've been able to implement HYPRE in my code and run it on multiple GPUs. However, when I try to enable GPU-aware MPI in HYPRE, I get the following types of segmentation faults when running the code: [acn16:283118:0:283118] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15450e000004) ==== backtrace (tid: 283118) ==== 0 0x0000000000012d20 __funlockfile() :0 1 0x00000000009a6891 hypre_FinalizeCommunication() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/struct_communication.c:1216 2 0x00000000009b37de hypre_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/struct_matrix.c:1436 3 0x00000000009968c6 HYPRE_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/HYPRE_struct_matrix.c:323 50e000004)
When HYPRE is not used, my code runs with GPU-aware MPI without problems. Any ideas what could be causing these errors?
Thank you, Ondrej
Hello, I have similar issues with Hypre (CG+Boomeramg, used through PETSc) with MPI Gpu-Aware.
OpenMPI 4.x (no GPU Aware) -> OK for all my tests OpenMPI 4.x GPU Aware -> KSP_DIVERGED for some tests OpenMPI 5.0.5 GPU Aware -> OK for all my tests !
The KSP_DIVERGED happens with Hypre Boomeramg above some number of GPUs and with CG solver. The issue may be bypassed by switching to BiCGstab solver...
Could you check your OpenMPI version and test with 5.0.5 ?
Thanks
Hi and thanks for your feedback!
I've been using OpenMPI 4.1.6 so I'll try some >5 version and let you know the result. For me, the problem is not solver-dependent and occurs at the first assemble of the matrix.
Cheers, Ondrej
Hi again,
I've tested with OpenMPI 5.0.5 but I am unfortunately getting the same segfault:
[acn35:1588861:0:1588861] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x153b88e00004) ==== backtrace (tid:1588861) ==== 0 0x0000000000012d10 __funlockfile() :0 1 0x00000000009a6851 hypre_FinalizeCommunication() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/struct_communication.c:1216 2 0x00000000009b379e hypre_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/struct_matrix.c:1436 3 0x0000000000996886 HYPRE_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/HYPRE_struct_matrix.c:323
Any other ideas are welcome...
Cheers, Ondrej
Dear HYPRE developers, I would appreciate some additional feedback.
I have been trying to adapt one of the example codes 'ex3.c' to reproduce the error occurring on my cluster. The modified source code can be found here: https://github.com/ondrejchrenko/HYPRE_ex3
Could you please let me know:
- if the modifications that I've done correctly convert the given example for a usage on GPU clusters with CUDA-aware MPI
- if you can reproduce the error when running the example on multiple GPUs with CUDA-aware MPI or if the code runs OK for you
Cheers, Ondrej
@ondrejchrenko I apologize for the delay with this.
There was a bug in hypre for the scenario you described. Could you please test the PR linked to this issue?
Dear @victorapm , thank you for letting me know. I can't seem to get the GPU-aware MPI working in my HYPRE application. It certainly works on the cluster (for my other codes), but not when used with HYPRE. It would really help me to have a simple example that has been successfully tested with GPU-aware MPI. Then I can ask the cluster support to work out (possible) compiler issues etc.
For instance, I've linked earlier in this thread an example which I think should run with GPU-aware MPI (because it does work on multiple GPUs with standard MPI). Could you check the example and let me know if I am correct? Or if additional changes have to be made to make it compatible with GPU-aware MPI?
Cheers, Ondrej