llvm
llvm copied to clipboard
SYCL/HIP: "corrupted double-linked list" from hip_piProgramRelease
Describe the bug
After program completion, when all the resources are getting deinitialized, it aborts with "corrupted double-linked list".
The error is semi-random. Seems much more likely to be triggered when there are multiple processes using the GPU, although happens also with a single process.
To Reproduce
Build IntelLLVM a1b42aa6037aba9b86d40d8c1c59c0dc2f941481 with HIP (ROCm 5.0.2) and OpenMP. Older version (ca. March 2022) also suffer from the same problem, but I have not bisected ealier.
Code: https://gist.github.com/al42and/1bbaf3df22d1af5382cf9f40056cc5b2. Just runs a simple kernel a few times in a loop.
-
clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 run.cpp -o run
-
for i in $(seq 1 100); do SYCL_DEVICE_FILTER=hip:gpu ./run ; done
. - Wait a few seconds.
-
corrupted double-linked list
errors start appearing. - Running multiple processes in parallel seems to increase the likelihood of a crash:
for i in $(seq 1 10); do SYCL_DEVICE_FILTER=hip:gpu ./run & done
. But I have not thoroughly confirmed that.
The GDB stack trace looks as follows:
(gdb) bt
#0 0x00007ffff766003b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007ffff763f859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007ffff76aa29e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007ffff76b232c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007ffff76b297c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007ffff76b2adf in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007ffff76b4010 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007ffff694bb69 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#8 0x00007ffff6946c71 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#9 0x00007ffff6946f19 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#10 0x00007ffff694d5d6 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#11 0x00007ffff67c3660 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#12 0x00007ffff6786d8f in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#13 0x00007ffff6786fc9 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#14 0x00007ffff68a6ced in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#15 0x00007ffff6883a4f in hipModuleUnload () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#16 0x00007ffff75c69f9 in hip_piProgramRelease () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libpi_hip.so
#17 0x00007ffff7ae5416 in cl::sycl::detail::KernelProgramCache::~KernelProgramCache() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#18 0x00007ffff7aa9a6c in cl::sycl::detail::context_impl::~context_impl() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#19 0x00007ffff7ad2c7c in cl::sycl::detail::releaseDefaultContexts() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#20 0x00007ffff7ad2d1d in cl::sycl::detail::DefaultContextReleaseHandler::~DefaultContextReleaseHandler() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#21 0x00007ffff76638d7 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x00007ffff7663a90 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#23 0x00007ffff76410ba in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#24 0x000000000040360e in _start ()
Environment:
- Ubuntu Linux, kernel 5.4.0-99-generic
- Target device and vendor: AMD MI50 GPU
- DPC++ version: a1b42aa6037aba9b86d40d8c1c59c0dc2f941481
- Dependencies version: ROCm 5.0.2
I couldn't reproduce the error. Hopefully, others could. clang++: 9a9a7a4026a0fe9892a275fe70e1c8330af89792 (around 5/13) device: gfx908 rocm: 4.5.2
The problem still reproduces on the same machine with cc03176dc3c938aa9fef808d57471d540b69931f and ROCm 4.5.2 and ROCm 5.0.2 but is much rarer with the latter.
And with latest ROCm version?
On a different machine with gfx1032 and ROCm 5.2.0 does not reproduce. On the original one with gfx906: will need some time to do the update.
The original machine, 1c3d598 (2022-10-06), ROCm 5.3.0, gfx906, kernel 5.15.0-48, does not reproduce anymore. I guess either the ROCm upgrade or the kernel upgrade did the trick.