AMGX icon indicating copy to clipboard operation
AMGX copied to clipboard

AMGX 2.2.0 invalid write error with cuda-memcheck

Open guignont opened this issue 3 years ago • 4 comments

Hello, We recently move to AMGX 2.2.0 from 2.1.0 and we notice that some cases that usually run well where sometimes failing with this error: Thrust failure: after reduction step 2: cudaErrorIllegalAddress: an illegal memory access was encountered File and line number are not available for this exception.

We check with cuda-memcheck and discover an invalid write throught a null pointer in amgx::distance2::compute_c_hat_kernel:

AMGX initialization, with GPU kernel... AMGX version 2.2.0.132-opensource Built on Apr 25 2022, 07:52:43 Compiled with CUDA Runtime 11.6, using CUDA driver 11.2 Cannot read file as JSON object, trying as AMGX config AMGX: AmgX use gpu 0 AMGX: nrings = 2 ========= Invalid global write of size 4 ========= at 0x00001a30 in void amgx::distance2::compute_c_hat_kernel<int=8, int=256, int=128, int=32>(int, int const *, int const , int const , bool const , int const , int, bool const *, bool const *, int, bool const *, bool const *, bool const *) ========= by thread (227,0,0) in block (1,0,0) ========= Address 0x00000000 is out of bounds

Note that when there is no invalid write our solver did not converge while it did with 2.1.0 We look for uninitialized memory for both cpu and gpu parts but nothing appears. Also the problem is not specific to a particular matrix.

running with cuda-gdb gives a similar error: CUDA Exception: Warp Illegal Address The exception was triggered at PC 0x1cd8fa50

Thread 1 "TestSolver.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address. [Switching focus to CUDA kernel 0, grid 587, block (0,0,0), thread (192,0,0), device 0, sm 0, warp 6, lane 0] 0x000000001cd90788 in void amgx::distance2::compute_c_hat_kernel<8, 256, 128, 32>(int, int const*, int const*, int const*, bool const*, int const*, int*, int*, int*, int, int*, int*, int*)<<<(128,1,1),(256,1,1)>>> ()

A+ tg

guignont avatar Jun 07 '22 15:06 guignont

Did you happen to try with the main branch? It is possible we fixed this issue already but it's not yet in an official release.

mattmartineau avatar Jun 14 '22 08:06 mattmartineau

Hello, With your remark I take a closer look at which branch we are using and it was master branch. May be I get confused by the AmgX output "AMGX version 2.2.0.132-opensource"

Testing with git branch 2.2.x gives the expected convergence behavior and there is no illegal memory access. Nevertheless AmgX still writes "AMGX version 2.2.0.132-opensource"

A+ tg

guignont avatar Jun 15 '22 14:06 guignont

OK thanks for checking. I'm not immediately sure why a null pointer would be present - could you share a matrix and configuration that reproduces the problem?

mattmartineau avatar Jun 15 '22 23:06 mattmartineau

Hello, FYI With AMGX 2.3.0 we don't observe the error Thrust failure: after reduction step 2: cudaErrorIllegalAddress: an illegal memory access was encountered but we don't have the right convergence behavior

A+ tg

guignont avatar Jul 19 '23 09:07 guignont