hypre icon indicating copy to clipboard operation
hypre copied to clipboard

HYPRE Struct - problems using GPU-aware MPI

Open ondrejchrenko opened this issue 1 year ago • 4 comments

Dear HYPRE developers,

following on issue #1126, I've been able to implement HYPRE in my code and run it on multiple GPUs. However, when I try to enable GPU-aware MPI in HYPRE, I get the following types of segmentation faults when running the code: [acn16:283118:0:283118] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15450e000004) ==== backtrace (tid: 283118) ==== 0 0x0000000000012d20 __funlockfile() :0 1 0x00000000009a6891 hypre_FinalizeCommunication() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/struct_communication.c:1216 2 0x00000000009b37de hypre_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/struct_matrix.c:1436 3 0x00000000009968c6 HYPRE_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu2/src/struct_mv/HYPRE_struct_matrix.c:323 50e000004)

When HYPRE is not used, my code runs with GPU-aware MPI without problems. Any ideas what could be causing these errors?

Thank you, Ondrej

ondrejchrenko avatar Sep 17 '24 20:09 ondrejchrenko

Hello, I have similar issues with Hypre (CG+Boomeramg, used through PETSc) with MPI Gpu-Aware.

OpenMPI 4.x (no GPU Aware) -> OK for all my tests OpenMPI 4.x GPU Aware -> KSP_DIVERGED for some tests OpenMPI 5.0.5 GPU Aware -> OK for all my tests !

The KSP_DIVERGED happens with Hypre Boomeramg above some number of GPUs and with CG solver. The issue may be bypassed by switching to BiCGstab solver...

Could you check your OpenMPI version and test with 5.0.5 ?

Thanks

pledac avatar Oct 16 '24 19:10 pledac

Hi and thanks for your feedback!

I've been using OpenMPI 4.1.6 so I'll try some >5 version and let you know the result. For me, the problem is not solver-dependent and occurs at the first assemble of the matrix.

Cheers, Ondrej

ondrejchrenko avatar Oct 17 '24 14:10 ondrejchrenko

Hi again,

I've tested with OpenMPI 5.0.5 but I am unfortunately getting the same segfault:

[acn35:1588861:0:1588861] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x153b88e00004) ==== backtrace (tid:1588861) ==== 0 0x0000000000012d10 __funlockfile() :0 1 0x00000000009a6851 hypre_FinalizeCommunication() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/struct_communication.c:1216 2 0x00000000009b379e hypre_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/struct_matrix.c:1436 3 0x0000000000996886 HYPRE_StructMatrixAssemble() /scratch/project/open-29-3/hypre-master_paragpu5/src/struct_mv/HYPRE_struct_matrix.c:323

Any other ideas are welcome...

Cheers, Ondrej

ondrejchrenko avatar Oct 18 '24 12:10 ondrejchrenko

Dear HYPRE developers, I would appreciate some additional feedback.

I have been trying to adapt one of the example codes 'ex3.c' to reproduce the error occurring on my cluster. The modified source code can be found here: https://github.com/ondrejchrenko/HYPRE_ex3

Could you please let me know:

  • if the modifications that I've done correctly convert the given example for a usage on GPU clusters with CUDA-aware MPI
  • if you can reproduce the error when running the example on multiple GPUs with CUDA-aware MPI or if the code runs OK for you

Cheers, Ondrej

ondrejchrenko avatar Oct 23 '24 20:10 ondrejchrenko

@ondrejchrenko I apologize for the delay with this.

There was a bug in hypre for the scenario you described. Could you please test the PR linked to this issue?

victorapm avatar Jan 27 '25 16:01 victorapm

Dear @victorapm , thank you for letting me know. I can't seem to get the GPU-aware MPI working in my HYPRE application. It certainly works on the cluster (for my other codes), but not when used with HYPRE. It would really help me to have a simple example that has been successfully tested with GPU-aware MPI. Then I can ask the cluster support to work out (possible) compiler issues etc.

For instance, I've linked earlier in this thread an example which I think should run with GPU-aware MPI (because it does work on multiple GPUs with standard MPI). Could you check the example and let me know if I am correct? Or if additional changes have to be made to make it compatible with GPU-aware MPI?

Cheers, Ondrej

ondrejchrenko avatar Apr 02 '25 10:04 ondrejchrenko