hypre cuda-memcheck errors with Cuda 11

Hi, I've been running running GMRES-BoomerAMG computations through the cuda-memcheck race check tool. I've tried 5 different compiler/cuda combos on 2 different tests (6 GPUs on a small test, 18 GPUs on a bigger test) on Summit.

cuda-memcheck flags: cuda-memcheck --tool racecheck --racecheck-report all

gcc/cuda versions: gcc 7.5.0/cuda 10.2.89 gcc 9.3.0/cuda 11.0.3 gcc 10.2.0/cuda 11.1.1 gcc 10.2.0/cuda 11.3.1 gcc 10.2.0/cuda 11.4.2

gcc 7.5.0/cuda 10.2.89 is the only combination that completes both tests cleanly.

Every other combination through an unspecified launch error. For 11.0.3/11.1.1, the failures occur at the end of GMRESSetup:

via:

========= CUDA-MEMCHECK
========= Internal Memcheck Error: Detected racecheck error buffer overflow. Some records have been dropped.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/sw/summit/cuda/11.1.1/lib64/libcuda.so.1 [0x2a741c]
=========     Host Frame:/lib64/power9/libpthread.so.0 [0x8ae0]
=========     Host Frame:/lib64/power9/libc.so.6 (clone + 0x74) [0x12e7c8]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

For 11.3.1 and 11.4.2, the failures occur some way into the GMRES iteration at (memory.c line 660). Prior to that, I see things like ...

========= CUDA-MEMCHECK
========= ERROR: Potential WAW hazard detected at __shared__ 0xb6 in block (2, 0, 0) :
=========     Write Thread (29, 0, 0) at 0x00003060 in _ZN8cusparse21load_balancing_kernelILj512ELj4ELm16384EiiNS_7CsrmvOpILi512EdLb1EEEJKiKdS4_didEEEvPKT3_T2_S5_S5_iPKS8_T4_DpPT5_
=========     Write Thread (30, 0, 0) at 0x00003060 in _ZN8cusparse21load_balancing_kernelILj512ELj4ELm16384EiiNS_7CsrmvOpILi512EdLb1EEEJKiKdS4_didEEEvPKT3_T2_S5_S5_iPKS8_T4_DpPT5_
=========     Current Value : 73, Incoming Value : 32
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/sw/summit/cuda/11.3.1/lib64/libcuda.so.1 (cuLaunchKernel + 0x58) [0x1e5cd8]
=========     Host Frame:/sw/summit/cuda/11.3.1/lib64/libcusparse.so.11 [0x6ba8f8]
=========     Host Frame:/sw/summit/cuda/11.3.1/lib64/libcusparse.so.11 [0x732754]
=========     Host Frame:/sw/summit/cuda/11.3.1/lib64/libcusparse.so.11 [0x16c4d0]
=========     Host Frame:/sw/summit/cuda/11.3.1/lib64/libcusparse.so.11 [0x17abd0]
=========     Host Frame:/sw/summit/cuda/11.3.1/lib64/libcusparse.so.11 (cusparseSpMV + 0x31c) [0xd846c]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_CSRMatrixMatvecCusparseNewAPI + 0x14c) [0x4949bc]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_CSRMatrixMatvecDevice + 0x170) [0x494d30]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_CSRMatrixMatvecOutOfPlace + 0xcc) [0x2b516c]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_CSRMatrixMatvec + 0x20) [0x2b51d0]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_ParCSRMatrixMatvecT + 0x4a4) [0x46a894]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_BoomerAMGCycle + 0x11dc) [0x1e1a4c]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_BoomerAMGSolve + 0x3b4) [0x1ba044]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (HYPRE_BoomerAMGSolve + 0x18) [0x193cc8]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (hypre_GMRESSolve + 0x6b0) [0x17bd20]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (HYPRE_GMRESSolve + 0x18) [0x1719b8]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/install/gpu/hypre-master-nouvm-2022-02-17-cuda-11.3.1-gcc-10.2.0/lib/libHYPRE-2.24.0.so (HYPRE_ParCSRGMRESSolve + 0x18) [0x198ac8]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/hypre-mini-app/build_cuda_nouvm_cuda_11.3.1_gcc_10.2.0/hypre_app [0x16e04]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/hypre-mini-app/build_cuda_nouvm_cuda_11.3.1_gcc_10.2.0/hypre_app [0x95ac]
=========     Host Frame:/lib64/power9/libc.so.6 [0x24078]

...

1    5.286120e+04    0.704449   7.044485e-01
2    1.934735e+04    0.366003   2.578302e-01
3    7.291080e+03    0.376852   9.716373e-02
4    2.638435e+03    0.361872   3.516080e-02
5    9.445420e+02    0.357993   1.258733e-02

CUDA ERROR (code = 700, an illegal memory access was encountered) at memory.c:660
CUDA ERROR (code = 700, an illegal memory access was encountered) at memory.c:660
.6 (__libc_start_main + 0xb4) [0x24264]
=========
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= RACECHECK SUMMARY: 10000 hazards displayed (9535347 errors, 2693543 warnings)
HYPRE_ParCSRGMRESSolve + 0x18) [0x198ac8]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/hypre-mini-app/build_cuda_nouvm_cuda_11.3.1_gcc_10.2.0/hypre_app [0x16e04]
=========     Host Frame:/gpfs/alpine/cfd116/scratch/mullowne/hypre-mini-app/build_cuda_nouvm_cuda_11.3.1_gcc_10.2.0/hypre_app [0x95ac]
=========     Host Frame:/lib64/power9/libc.so.6 [0x24078]
=========     Host Frame:/lib64/power9/libc.so.6 (__libc_start_main + 0xb4) [0x24264]
=========
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations

I was hoping that someone else could try something similar and perhaps replicate what I see. -Paul

Feb 18 '22 18:02 PaulMullowney

@PaulMullowney On lassen (which has the same GPU and CPU as summit), with gcc and CUDA/11.6, I had the same issue with

mpirun -n 1 mpibind cuda-memcheck --tool racecheck --racecheck-report all ./ij

Errors:

========= ERROR: Potential WAW hazard detected at __shared__ 0xe7 in block (0, 0, 0) :
=========     Write Thread (7, 0, 0) at 0x000030c0 in _ZN8cusparse21load_balancing_kernelILj512ELj4ELm16384EiiNS_7CsrmvOpILi512EdLb1EEEJKiKdS4_didEEEvPKT3_T2_S5_S5_iPKS8_T4_DpPT5_
=========     Write Thread (8, 0, 0) at 0x000030c0 in _ZN8cusparse21load_balancing_kernelILj512ELj4ELm16384EiiNS_7CsrmvOpILi512EdLb1EEEJKiKdS4_didEEEvPKT3_T2_S5_S5_iPKS8_T4_DpPT5_
=========     Current Value : 63, Incoming Value : 0
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/tce/packages/cuda/cuda-11.6.0/lib64/libcuda.so.1 (cuLaunchKernel + 0x58) [0x271d78]
=========     Host Frame:/usr/tce/packages/cuda/cuda-11.6.0/lib64/libcusparse.so.11 [0x877008]

which points to inside cusparse kernels? I guess it's a CUDA issue? @pwang234

Feb 19 '22 01:02 liruipeng

with CUDA 10.1 , it ran cleanly

Feb 19 '22 01:02 liruipeng

I can reproduce this. It looks the error is reported when calling cusparseSpMV for the first coarse level. However, I dumped the hypre matrix and wrote a standalone cusparseSpMV test. Then it works fine with no racecheck error. So it's not clear whether this is just a cusparse issue or it's due to some interaction with other parts of Hypre. I will keep looking.

Feb 23 '22 20:02 pwang234

It looks the key parameter to trigger the racecheck error is the value of "beta". The error will be triggered only when beta is zero, which can be reproduced using the standalone test as well. The coarse level grid is the first call to cusparseSpMV with beta=0. So this does look like a cusparse issue. I will file a bug to the cusparse team about this.

Feb 23 '22 21:02 pwang234

Thanks @pwang234 !

Feb 23 '22 21:02 liruipeng

cusparse team confirmed that the racecheck error when beta=0 is a false warning. It was for optimization purpose and doesn't affect the correctness. So this particular racecheck error can be ignored.

Feb 23 '22 22:02 pwang234

Thanks for investigating. This is helpful.

Feb 23 '22 23:02 PaulMullowney

I need to make these memcheck errors go away. @liruipeng do you see anything wrong with my changes below? I'm trying to save the y vector before the SpMV, do a SpMV with beta=1, and then subtract the saved vector out after the SpMV.

index 85677d62e..d2e730e18 100644
--- a/src/seq_mv/csr_matvec_device.c
+++ b/src/seq_mv/csr_matvec_device.c
@@ -162,6 +162,12 @@ hypre_CSRMatrixMatvecCusparseNewAPI( HYPRE_Int        trans,
    cusparseDnVecDescr_t vecX = hypre_VectorToCusparseDnVec(x,      0, x_size_override);
    cusparseDnVecDescr_t vecY = hypre_VectorToCusparseDnVec(y, offset, y_size_override - offset);

+   HYPRE_Int ysize = y_size_override - offset;
+   HYPRE_Complex * temp = hypre_CTAlloc(HYPRE_Complex, ysize, HYPRE_MEMORY_DEVICE);
+   hypre_TMemcpy(temp, hypre_VectorData(y)+offset, HYPRE_Complex, ysize,
+                 HYPRE_MEMORY_DEVICE, HYPRE_MEMORY_DEVICE);
+   beta=1.0;
+
    if (!dBuffer)
    {
       HYPRE_CUSPARSE_CALL( cusparseSpMV_bufferSize(handle,
@@ -201,6 +207,11 @@ hypre_CSRMatrixMatvecCusparseNewAPI( HYPRE_Int        trans,

    hypre_SyncComputeStream(hypre_handle());

+   HYPRE_Complex minus_one;
+   HYPRE_THRUST_CALL( transform, temp, temp + ysize, hypre_VectorData(y)+offset,
+                      hypre_VectorData(y)+offset, minus_one * _1 + _2 );
+   hypre_TFree(temp,HYPRE_MEMORY_DEVICE);
+

Mar 31 '22 18:03 PaulMullowney

Nevermind, I see my mistake.

Mar 31 '22 18:03 PaulMullowney

The diff below works. Once I added the changes below, the Cusparse related race check complaints went away and other compute-sanitizer errors appeared in other parts of the application. These other problems were impossible to see while Cusparse related "false positives" were polluting the compute-sanitizer output. Moreover, I'll add that the compute-sanitizer is able to progress much further into a simulation (i.e. run a lot faster) without a lot of spurious output.

I think the Cusparse team needs to fix this.

index 85677d62e..a5271484e 100644
--- a/src/seq_mv/csr_matvec_device.c
+++ b/src/seq_mv/csr_matvec_device.c
@@ -162,6 +162,21 @@ hypre_CSRMatrixMatvecCusparseNewAPI( HYPRE_Int        trans,
    cusparseDnVecDescr_t vecX = hypre_VectorToCusparseDnVec(x,      0, x_size_override);
    cusparseDnVecDescr_t vecY = hypre_VectorToCusparseDnVec(y, offset, y_size_override - offset);

+   HYPRE_Int ysize = 0;
+   HYPRE_Complex * temp=NULL;
+   if (beta==0.0) {
+       if (y_size_override - offset  >= 0)
+           ysize = y_size_override - offset;
+       else
+           ysize = hypre_VectorSize(y) - offset;
+
+       temp = hypre_CTAlloc(HYPRE_Complex, ysize, HYPRE_MEMORY_DEVICE);
+
+       hypre_TMemcpy(temp, hypre_VectorData(y)+offset, HYPRE_Complex, ysize,
+                     HYPRE_MEMORY_DEVICE, HYPRE_MEMORY_DEVICE);
+       beta=1.0;
+   }
+
    if (!dBuffer)
    {
       HYPRE_CUSPARSE_CALL( cusparseSpMV_bufferSize(handle,
@@ -201,6 +216,13 @@ hypre_CSRMatrixMatvecCusparseNewAPI( HYPRE_Int        trans,

    hypre_SyncComputeStream(hypre_handle());

+   if (temp) {
+       HYPRE_Complex minus_one=-1.0;
+       HYPRE_THRUST_CALL( transform, temp, temp + ysize, hypre_VectorData(y)+offset,
+                          hypre_VectorData(y)+offset, minus_one * _1 + _2 );
+       hypre_TFree(temp,HYPRE_MEMORY_DEVICE);
+   }
+
    if (trans)
    {
       hypre_CSRMatrixDestroy(AT);

Apr 01 '22 12:04 PaulMullowney

Here's a simpler fix that works on many GPUs. https://github.com/PaulMullowney/hypre/commit/f9fbf47a71f24f339ce875e9ce097c0022a659e8

Apr 02 '22 16:04 PaulMullowney

hypre hypre copied to clipboard

cuda-memcheck errors with Cuda 11

hypre
hypre copied to clipboard