[Bug]: multiple kernels with a 2D domain on remote variables results in internal error
Summary of Problem
The following code produces the error "gpu-nvidia.c:292: Error calling CUDA function: an illegal memory access was encountered".
const D = {0..<10, 0..<10};
on here.gpus[0] var A: [D] bool;
on here.gpus[0] var B: [D] bool;
on here.gpus[0] {
const DD = D; // localize domain
forall idx in DD do B = A[idx];
var neq: [DD] bool;
foreach idx in DD do neq[idx] = A[idx] != B[idx];
}
There are two kernels in this code, the forall and the foreach. Commenting out one or the other results makes the error go away. Also note that D is a 2D domain, if its 1D then the error does not occur. Lastly, changing the declaration of A and B to be declared inside the on block (instead of being remote variable declarations) makes the error go away.
Configuration Information
- Output of
chpl --version: 2.2.0 pre-release - Output of
$CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: native
CHPL_LOCALE_MODEL: gpu *
CHPL_GPU: nvidia *
CHPL_COMM: none
CHPL_TASKS: qthreads
CHPL_LAUNCHER: none
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: bundled *
CHPL_AUX_FILESYS: none
- Back-end compiler and version, e.g.
gcc --versionorclang --version: LLVM 18
Some other data points/thoughts:
- making the first loop a
foreachalso works - between
forall-as-the-first-loop andforeach-as-the-first-loop, there's an LICM difference in the second loop. Where the former causesAandBto be not LICM'ed leaving metadata inside the kernel. In the latter case, all we have isddatas, and that works fine - LICM shouldn't impact correctness. What the real bug is is a bit curious certainly.
- I am curious whether having
AandBas remote declared variables vs local ones in scope is also impacting LICM, rather than being the root cause of the issue itself. IOW, the fact that remote-declared-ness of these variables only impact things because of different AST structure. - On newer CUDAs, I actually see
misaligned address, which is much harder to debug. I wonder if we should try to debug this on an older CUDA with cuda-gdb to understand what's wrong.
General info on passing arrays as a whole (the array record) to kernels:
- Our implementation there has always been a bit shaky, which gets papered over by aggressive LICM for the most part.
- A GPU array has a CPU-based record that wraps a GPU-based class. We should be able to pass C structs/Chapel records directly to CUDA kernels as they are stack allocated. However, my understanding there could be wrong for the CUDA driver API we use. We pass addresses of parameters to the kernel launch API. It doesn't make much sense to pass the address of something stack-allocated, like the array record.
- To address that, we are supposed to pass array records by offload -- we allocate memory on the device, bit-copy the record and use that record. Is something going wrong there?
On newer CUDAs, I actually see misaligned address, which is much harder to debug. I wonder if we should try to debug this on an older CUDA with cuda-gdb to understand what's wrong.
Just noting that I saw this as well. Sometimes the runs would be "illegal memory access" and sometimes it was "misaligned address"
Are N dim domains still only parallel over the first dimension on GPUs?
Yes. See https://github.com/chapel-lang/chapel/issues/22152 and https://github.com/chapel-lang/chapel/issues/24331
This might get lost in a previous comment I made, but based on your recollection (not asking you to rerun anything) @jabraham17 would it be correct to say that using foreach for both loops is the acceptable workaround for the scenario in the OP?
This might get lost in a previous comment I made, but based on your recollection (not asking you to rerun anything) @jabraham17 would it be correct to say that using foreach for both loops is the acceptable workaround for the scenario in the OP?
Yes, using only foreach for both loops is a good workaround for this issue