llvm icon indicating copy to clipboard operation
llvm copied to clipboard

SYCL/HIP: "corrupted double-linked list" from hip_piProgramRelease

Open al42and opened this issue 2 years ago • 3 comments

Describe the bug

After program completion, when all the resources are getting deinitialized, it aborts with "corrupted double-linked list".

The error is semi-random. Seems much more likely to be triggered when there are multiple processes using the GPU, although happens also with a single process.

To Reproduce

Build IntelLLVM a1b42aa6037aba9b86d40d8c1c59c0dc2f941481 with HIP (ROCm 5.0.2) and OpenMP. Older version (ca. March 2022) also suffer from the same problem, but I have not bisected ealier.

Code: https://gist.github.com/al42and/1bbaf3df22d1af5382cf9f40056cc5b2. Just runs a simple kernel a few times in a loop.

  • clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 run.cpp -o run
  • for i in $(seq 1 100); do SYCL_DEVICE_FILTER=hip:gpu ./run ; done.
  • Wait a few seconds.
  • corrupted double-linked list errors start appearing.
  • Running multiple processes in parallel seems to increase the likelihood of a crash: for i in $(seq 1 10); do SYCL_DEVICE_FILTER=hip:gpu ./run & done. But I have not thoroughly confirmed that.

The GDB stack trace looks as follows:

(gdb) bt
#0  0x00007ffff766003b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff763f859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ffff76aa29e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ffff76b232c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ffff76b297c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x00007ffff76b2adf in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x00007ffff76b4010 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007ffff694bb69 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#8  0x00007ffff6946c71 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#9  0x00007ffff6946f19 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#10 0x00007ffff694d5d6 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#11 0x00007ffff67c3660 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#12 0x00007ffff6786d8f in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#13 0x00007ffff6786fc9 in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#14 0x00007ffff68a6ced in ?? () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#15 0x00007ffff6883a4f in hipModuleUnload () from /opt/tcbsys/rocm/5.0.2/hip/lib/libamdhip64.so.5
#16 0x00007ffff75c69f9 in hip_piProgramRelease () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libpi_hip.so
#17 0x00007ffff7ae5416 in cl::sycl::detail::KernelProgramCache::~KernelProgramCache() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#18 0x00007ffff7aa9a6c in cl::sycl::detail::context_impl::~context_impl() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#19 0x00007ffff7ad2c7c in cl::sycl::detail::releaseDefaultContexts() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#20 0x00007ffff7ad2d1d in cl::sycl::detail::DefaultContextReleaseHandler::~DefaultContextReleaseHandler() () from /nethome/aland/modules/intel-llvm/20220527-a1b42aa6-rocm5.0/lib/libsycl.so.5
#21 0x00007ffff76638d7 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#22 0x00007ffff7663a90 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#23 0x00007ffff76410ba in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#24 0x000000000040360e in _start ()

Environment:

  • Ubuntu Linux, kernel 5.4.0-99-generic
  • Target device and vendor: AMD MI50 GPU
  • DPC++ version: a1b42aa6037aba9b86d40d8c1c59c0dc2f941481
  • Dependencies version: ROCm 5.0.2

al42and avatar May 27 '22 16:05 al42and

I couldn't reproduce the error. Hopefully, others could. clang++: 9a9a7a4026a0fe9892a275fe70e1c8330af89792 (around 5/13) device: gfx908 rocm: 4.5.2

zjin-lcf avatar May 27 '22 23:05 zjin-lcf

The problem still reproduces on the same machine with cc03176dc3c938aa9fef808d57471d540b69931f and ROCm 4.5.2 and ROCm 5.0.2 but is much rarer with the latter.

al42and avatar Sep 16 '22 15:09 al42and

And with latest ROCm version?

keryell avatar Sep 20 '22 18:09 keryell

On a different machine with gfx1032 and ROCm 5.2.0 does not reproduce. On the original one with gfx906: will need some time to do the update.

al42and avatar Sep 26 '22 13:09 al42and

The original machine, 1c3d598 (2022-10-06), ROCm 5.3.0, gfx906, kernel 5.15.0-48, does not reproduce anymore. I guess either the ROCm upgrade or the kernel upgrade did the trick.

al42and avatar Oct 07 '22 10:10 al42and