AMGX icon indicating copy to clipboard operation
AMGX copied to clipboard

amgx_mpi_poisson5pt and amgx_mpi_poisson7

Open pasquadambra opened this issue 4 years ago • 6 comments

Hi, I just installed AmgX on Piz Daint and I need to run the Laplace examples both in 2d and in 3d.

I am not able to run the code on more nodes by using increasing dimensions. For example, if I run the following command: srun -C gpu -n 4 examples/amgx_mpi_poisson5pt -p 256 256 -c ../core/configs/PCG_AGGREGATION_JACOBI.json

for using a fixed grid of 256x256 grid nodes per each MPI tasks, requiring 4 MPI tasks I have the following error:

srun: job 26918173 queued and waiting for resources srun: job 26918173 has been allocated resources Process 0 selecting device 0 AMGX version 2.1.0.131-opensource Built on Nov 12 2020, 17:58:51 Compiled with CUDA Runtime 10.2, using CUDA driver 10.2 Warning: No mode specified, using dDDI by default. Cannot read file as JSON object, trying as AMGX config Converting config string to current config version Parsing configuration string: exception_handling=1 ; Caught amgx exception: Cannot allocate pinned memory at: /users/pdambra/AMGX/base/src/global_thread_handle.cu:374 Stack trace: /users/pdambra/AMGX/build/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0xef /users/pdambra/AMGX/build/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b /users/pdambra/AMGX/build/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xb5e /users/pdambra/AMGX/build/libamgxsh.so : AMGX_resources_create()+0xa4 examples/amgx_mpi_poisson5pt() [0x402835] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson5pt() [0x4035ca]

Caught signal 11 - SIGSEGV (segmentation violation) Process 1 selecting device 0 Warning: No mode specified, using dDDI by default. Caught amgx exception: Cannot allocate pinned memory at: /users/pdambra/AMGX/base/src/global_thread_handle.cu:374 Stack trace: /users/pdambra/AMGX/build/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0xef /users/pdambra/AMGX/build/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b /users/pdambra/AMGX/build/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xb5e /users/pdambra/AMGX/build/libamgxsh.so : AMGX_resources_create()+0xa4 examples/amgx_mpi_poisson5pt() [0x402835] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson5pt() [0x4035ca]

/users/pdambra/AMGX/build/libamgxsh.so : amgx::handle_signals(int)+0x9a /apps/daint/UES/xalt/xalt2/software/xalt/2.8.10/lib64/libpthread.so.0 : ()+0x132d0 /users/pdambra/AMGX/build/libamgxsh.so : amgx::CWrapHandle<AMGX_resources_handle_struct*, amgx::Resources>::CWrapHandle(AMGX_resources_handle_struct*)+0x37 /users/pdambra/AMGX/build/libamgxsh.so : AMGX_matrix_create_impl()+0x46 /users/pdambra/AMGX/build/libamgxsh.so : AMGX_matrix_create()+0x3a examples/amgx_mpi_poisson5pt() [0x40284b] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson5pt() [0x4035ca] /users/pdambra/AMGX/build/libamgxsh.so : amgx::handle_signals(int)+0x9a /apps/daint/UES/xalt/xalt2/software/xalt/2.8.10/lib64/libpthread.so.0 : ()+0x132d0 /users/pdambra/AMGX/build/libamgxsh.so : amgx::CWrapHandle<AMGX_resources_handle_struct*, amgx::Resources>::CWrapHandle(AMGX_resources_handle_struct*)+0x37 /users/pdambra/AMGX/build/libamgxsh.so : AMGX_matrix_create_impl()+0x46 /users/pdambra/AMGX/build/libamgxsh.so : AMGX_matrix_create()+0x3a examples/amgx_mpi_poisson5pt() [0x40284b] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson5pt() [0x4035ca] Process 2 selecting device 0 Warning: No mode specified, using dDDI by default. Caught amgx exception: Cannot allocate pinned memory at: /users/pdambra/AMGX/base/src/global_thread_handle.cu:374 Stack trace: /users/pdambra/AMGX/build/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0xef /users/pdambra/AMGX/build/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b /users/pdambra/AMGX/build/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xb5e /users/pdambra/AMGX/build/libamgxsh.so : AMGX_resources_create()+0xa4 examples/amgx_mpi_poisson5pt() [0x402835] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson5pt() [0x4035ca]

/users/pdambra/AMGX/build/libamgxsh.so : amgx::handle_signals(int)+0x9a /apps/daint/UES/xalt/xalt2/software/xalt/2.8.10/lib64/libpthread.so.0 : ()+0x132d0 /users/pdambra/AMGX/build/libamgxsh.so : amgx::CWrapHandle<AMGX_resources_handle_struct*, amgx::Resources>::CWrapHandle(AMGX_resources_handle_struct*)+0x37 /users/pdambra/AMGX/build/libamgxsh.so : AMGX_matrix_create_impl()+0x46 /users/pdambra/AMGX/build/libamgxsh.so : AMGX_matrix_create()+0x3a examples/amgx_mpi_poisson5pt() [0x40284b] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson5pt() [0x4035ca] srun: error: nid02525: tasks 0,2: Segmentation fault srun: Terminating job step 26918173.0 slurmstepd: error: *** STEP 26918173.0 ON nid02525 CANCELLED AT 2020-11-13T11:26:37 *** srun: error: nid02525: task 3: Terminated srun: error: nid02525: task 1: Segmentation fault (core dumped) srun: Force Terminated job step 26918173.0

Similar errors occur for the 3d example:

srun -C gpu -N 2 examples/amgx_mpi_poisson7 -p 64 64 64 -c ../core/configs/PCG_AGGREGATION_JACOBI.json srun: job 26918336 queued and waiting for resources

srun: job 26918336 has been allocated resources Process 0 selecting device 0 AMGX version 2.1.0.131-opensource Built on Nov 12 2020, 17:58:51 Compiled with CUDA Runtime 10.2, using CUDA driver 10.2 Warning: No mode specified, using dDDI by default. Cannot read file as JSON object, trying as AMGX config Converting config string to current config version Parsing configuration string: exception_handling=1 ; Caught signal 11 - SIGSEGV (segmentation violation) /users/pdambra/AMGX/build/libamgxsh.so : amgx::handle_signals(int)+0x9a /apps/daint/UES/xalt/xalt2/software/xalt/2.8.10/lib64/libpthread.so.0 : ()+0x132d0 /lib64/libc.so.6 : ()+0x3deb0 examples/amgx_mpi_poisson7() [0x4029ab] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson7() [0x4031aa] Process 1 selecting device 0 Warning: No mode specified, using dDDI by default. /users/pdambra/AMGX/build/libamgxsh.so : amgx::handle_signals(int)+0x9a /apps/daint/UES/xalt/xalt2/software/xalt/2.8.10/lib64/libpthread.so.0 : ()+0x132d0 /lib64/libc.so.6 : ()+0x3deb0 examples/amgx_mpi_poisson7() [0x4029ab] /lib64/libc.so.6 : __libc_start_main()+0xea examples/amgx_mpi_poisson7() [0x4031aa]

Could you help me? Thanks

pasquadambra avatar Nov 13 '20 10:11 pasquadambra

The error message suggests that the library is not able to allocate pinned memory.

There is a call to cudaMallocHost inside the initial resource allocation that is returning a null pointer. Presumably this is the first time that pinned memory has been requested for allocation, so it's hard to say exactly why this is happening.

Did you ever discover the root cause?

mattmartineau avatar Feb 01 '21 11:02 mattmartineau

On several clusters, I got some error messages (Caught amgx exception: Cannot allocate pinned memory) + hangs on AmgX when I used static AmgX library instead of shared lib one with our code. But according to the OP log, (/users/pdambra/AMGX/build/libamgxsh.so) it is not the reason.

pledac avatar Mar 22 '21 15:03 pledac

@pasquadambra Like Matt mentioned first error regarding pinned memory allocation is a fatal one.

I am not able to run the code on more nodes by using increasing dimensions.

Does it work with some dimensions? With other examples/configurations?

Can you successfully allocate some pinned memory using cudaMallocHost in a standalone cuda app? AMGX allocates 100MB of pinned memory at the resource allocation. I assume that in general GPU allocation works correctly with your slurm command and GPUs are available for the process. I also note that each process uses GPU #0, but i also assume that slurm makes one single distinct GPU visible for each process.

marsaev avatar Mar 22 '21 19:03 marsaev

@pledac does it happen only with static library?

marsaev avatar Mar 22 '21 19:03 marsaev

@pledac does it happen only with static library?

Yes, once I switched to libamgxsh.so, the two issues (on two different clusters) vanished. I should add that the hangs on AmgX didn't happen for all my test cases (different configurations). And it was also random on a same configuration. As the amgx_mpi_poisson7 binary was working well on the two clusters with the same configurations than our code, I looked at the differences between the binaries and detected that the linked AmgX library was not the same....

pledac avatar Mar 22 '21 20:03 pledac

Hmmm, i cannot blame anything inside AMGX from the top of my head, but this is definitely suspicious. To be honest i haven't used shared library myself in a long time. I will create tracking item regarding shared library. Thanks for providing info.

marsaev avatar Mar 24 '21 13:03 marsaev