chapel icon indicating copy to clipboard operation
chapel copied to clipboard

[Bug]: multiple kernels with a 2D domain on remote variables results in internal error

Open jabraham17 opened this issue 1 year ago • 6 comments

Summary of Problem

The following code produces the error "gpu-nvidia.c:292: Error calling CUDA function: an illegal memory access was encountered".

const D = {0..<10, 0..<10};
on here.gpus[0] var A: [D] bool;
on here.gpus[0] var B: [D] bool;
on here.gpus[0] {
  const DD = D; // localize domain
  forall idx in DD do B = A[idx];
  var neq: [DD] bool;
  foreach idx in DD do neq[idx] = A[idx] != B[idx];
}

There are two kernels in this code, the forall and the foreach. Commenting out one or the other results makes the error go away. Also note that D is a 2D domain, if its 1D then the error does not occur. Lastly, changing the declaration of A and B to be declared inside the on block (instead of being remote variable declarations) makes the error go away.

Configuration Information

  • Output of chpl --version: 2.2.0 pre-release
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
  CHPL_GPU: nvidia *
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
  • Back-end compiler and version, e.g. gcc --version or clang --version: LLVM 18

jabraham17 avatar Jul 29 '24 23:07 jabraham17

Some other data points/thoughts:

  • making the first loop a foreach also works
  • between forall-as-the-first-loop and foreach-as-the-first-loop, there's an LICM difference in the second loop. Where the former causes A and B to be not LICM'ed leaving metadata inside the kernel. In the latter case, all we have is ddatas, and that works fine
  • LICM shouldn't impact correctness. What the real bug is is a bit curious certainly.
  • I am curious whether having A and B as remote declared variables vs local ones in scope is also impacting LICM, rather than being the root cause of the issue itself. IOW, the fact that remote-declared-ness of these variables only impact things because of different AST structure.
  • On newer CUDAs, I actually see misaligned address, which is much harder to debug. I wonder if we should try to debug this on an older CUDA with cuda-gdb to understand what's wrong.

General info on passing arrays as a whole (the array record) to kernels:

  • Our implementation there has always been a bit shaky, which gets papered over by aggressive LICM for the most part.
  • A GPU array has a CPU-based record that wraps a GPU-based class. We should be able to pass C structs/Chapel records directly to CUDA kernels as they are stack allocated. However, my understanding there could be wrong for the CUDA driver API we use. We pass addresses of parameters to the kernel launch API. It doesn't make much sense to pass the address of something stack-allocated, like the array record.
  • To address that, we are supposed to pass array records by offload -- we allocate memory on the device, bit-copy the record and use that record. Is something going wrong there?

e-kayrakli avatar Jul 29 '24 23:07 e-kayrakli

On newer CUDAs, I actually see misaligned address, which is much harder to debug. I wonder if we should try to debug this on an older CUDA with cuda-gdb to understand what's wrong.

Just noting that I saw this as well. Sometimes the runs would be "illegal memory access" and sometimes it was "misaligned address"

jabraham17 avatar Jul 29 '24 23:07 jabraham17

Are N dim domains still only parallel over the first dimension on GPUs?

Iainmon avatar Jul 31 '24 06:07 Iainmon

Yes. See https://github.com/chapel-lang/chapel/issues/22152 and https://github.com/chapel-lang/chapel/issues/24331

e-kayrakli avatar Jul 31 '24 14:07 e-kayrakli

This might get lost in a previous comment I made, but based on your recollection (not asking you to rerun anything) @jabraham17 would it be correct to say that using foreach for both loops is the acceptable workaround for the scenario in the OP?

e-kayrakli avatar Aug 15 '24 18:08 e-kayrakli

This might get lost in a previous comment I made, but based on your recollection (not asking you to rerun anything) @jabraham17 would it be correct to say that using foreach for both loops is the acceptable workaround for the scenario in the OP?

Yes, using only foreach for both loops is a good workaround for this issue

jabraham17 avatar Aug 15 '24 18:08 jabraham17