HIP `hipStreamDestroy` hangs

Hi! I wonder what could be the reason why hipStreamDestroy might hang at event waiting:

:3:hip_stream.cpp           :464 : 2813836027 us: [pid:29129 tid:0x7fec4bfadd00]  hipStreamDestroy ( stream:0x1688950 ) 
:4:command.cpp              :349 : 2813836030 us: [pid:29129 tid:0x7fec4bfadd00] Command (Marker) enqueued: 0x12980d0
:3:rocvirtual.cpp           :455 : 2813836036 us: [pid:29129 tid:0x7fec4bfadd00] Set Handler: handle(0x7febc7a86d80), timestamp(0x1249eb0)
:4:rocvirtual.cpp           :1011: 2813836040 us: [pid:29129 tid:0x7fec4bfadd00] HWq=0x7fea6c200000, BarrierAND Header = 0x1503 (type=3, barrier=1, acquire=2, release=2), dep_signal=[0x0, 0x0, 0x0, 0x0, 0x0], completion_signal=0x7febc7a86d80
:4:command.cpp              :253 : 2813836042 us: [pid:29129 tid:0x7fec4bfadd00] Waiting for event 0x12980d0 to complete, current status 2

This happens all the time with RX7900XT, ROCm 5.6 & 5.7 & Linux 5.19 & 6.2. RX6700XT, for example, does not have these issues at all.

MWE (written in Julia language using AMDGPU.jl) just creates 1 low priority stream, 4 normal streams, 4 high priority streams and then destroys them. These are direct ccalls so there should be no Julia-specific stuff involved. Also, trying same on Windows does not reproduce it. In fact, RX7900XT on Windows works fine and passes all the tests in AMDGPU.jl that we have.

MWE:

using AMDGPU
function main()
    sl1 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(sl1, 0, 1) |> AMDGPU.check

    s1 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(s1, 0, 0) |> AMDGPU.check
    s2 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(s2, 0, 0) |> AMDGPU.check
    s3 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(s3, 0, 0) |> AMDGPU.check
    s4 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(s4, 0, 0) |> AMDGPU.check

    sh1 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(sh1, 0, -1) |> AMDGPU.check
    sh2 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(sh2, 0, -1) |> AMDGPU.check
    sh3 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(sh3, 0, -1) |> AMDGPU.check
    sh4 = Ref{AMDGPU.HIP.hipStream_t}()
    AMDGPU.HIP.hipStreamCreateWithPriority(sh4, 0, -1) |> AMDGPU.check

    AMDGPU.HIP.hipStreamDestroy(sl1[]) |> AMDGPU.check

    AMDGPU.HIP.hipStreamDestroy(s1[]) |> AMDGPU.check
    AMDGPU.HIP.hipStreamDestroy(s2[]) |> AMDGPU.check
    AMDGPU.HIP.hipStreamDestroy(s3[]) |> AMDGPU.check
    AMDGPU.HIP.hipStreamDestroy(s4[]) |> AMDGPU.check

    AMDGPU.HIP.hipStreamDestroy(sh1[]) |> AMDGPU.check
    AMDGPU.HIP.hipStreamDestroy(sh2[]) |> AMDGPU.check
    AMDGPU.HIP.hipStreamDestroy(sh3[]) |> AMDGPU.check
    AMDGPU.HIP.hipStreamDestroy(sh4[]) |> AMDGPU.check
end
main()

Nov 23 '23 09:11 pxl-th

Here's C++ reproducer:

#include <hip/hip_runtime.h>
#include <iostream>

using namespace std;

void check(int res) {
    if (res != 0) {
        std::cerr << "Fail" << std::endl;
    }
}

int main(int argc, char* argv[]) {
    hipStream_t sl1;
    check(hipStreamCreateWithPriority(&sl1, 0, 1));

    hipStream_t s1;
    check(hipStreamCreateWithPriority(&s1, 0, 0));
    hipStream_t s2;
    check(hipStreamCreateWithPriority(&s2, 0, 0));
    hipStream_t s3;
    check(hipStreamCreateWithPriority(&s3, 0, 0));
    hipStream_t s4;
    check(hipStreamCreateWithPriority(&s4, 0, 0));

    hipStream_t sh1;
    check(hipStreamCreateWithPriority(&sh1, 0, -1));
    hipStream_t sh2;
    check(hipStreamCreateWithPriority(&sh2, 0, -1));
    hipStream_t sh3;
    check(hipStreamCreateWithPriority(&sh3, 0, -1));
    hipStream_t sh4;
    check(hipStreamCreateWithPriority(&sh4, 0, -1));

    check(hipStreamDestroy(sl1));

    check(hipStreamDestroy(s1));
    check(hipStreamDestroy(s2));
    check(hipStreamDestroy(s3));
    check(hipStreamDestroy(s4));

    check(hipStreamDestroy(sh1));
    check(hipStreamDestroy(sh2));
    check(hipStreamDestroy(sh3));
    check(hipStreamDestroy(sh4));

    return 0;
}

Nov 23 '23 10:11 pxl-th

In this MWE it is important to have different priorities, but that might not be the only reproducer. If I don't run tests that test priorities, then it also hangs, but it is harder to narrow down.

Nov 23 '23 10:11 pxl-th

Will work with C++ example, since I do not have a Julia compiler with AMDGPU support installed.

I can not seem to reproduce this. Can you share a bit more information, like ROCm version you are using, GPU name?

Also the log level, can you share the entire log for this sample, might help us trace the command that is getting stuck.

Nov 23 '23 10:11 cjatin

Do tell if you need any other info.

GPU: RX7900XT (gfx1100) ROCm: 5.7.1 (from amdgpu-install script) OS: Ubuntu 22.04 Kernel: 6.2.0-37-generic

Here's full log from C++ MWE: log.txt

Nov 23 '23 10:11 pxl-th

Actually, after rebooting machine both Julia and C++ MWE are not reproducible. But once you run AMDGPU.jl tests they hang in hipStreamDestroy and after that all MWEs are reproducible again.

Nov 23 '23 10:11 pxl-th

dmesg output is full of page faults after running tests:

[ 2367.046840] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 2367.046848] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[ 2367.046851] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 2367.046853] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0000]
[ 2367.046854] amdgpu 0000:2f:00.0: amdgpu: 	Faulty UTCL2 client ID: CPC (0x5)
[ 2367.046858] amdgpu 0000:2f:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 2367.046860] amdgpu 0000:2f:00.0: amdgpu: 	WALKER_ERROR: 0x5
[ 2367.046862] amdgpu 0000:2f:00.0: amdgpu: 	PERMISSION_FAULTS: 0x3
[ 2367.046865] amdgpu 0000:2f:00.0: amdgpu: 	MAPPING_ERROR: 0x1
[ 2367.046867] amdgpu 0000:2f:00.0: amdgpu: 	RW: 0x0
[ 2367.046914] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]
[ 2371.109083] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]
[ 2376.497646] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 2376.497653] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[ 2376.497657] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 2376.497659] amdgpu 0000:2f:00.0: amdgpu: 	Faulty UTCL2 client ID: CPC (0x5)
[ 2376.497662] amdgpu 0000:2f:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 2376.497664] amdgpu 0000:2f:00.0: amdgpu: 	WALKER_ERROR: 0x5
[ 2376.497667] amdgpu 0000:2f:00.0: amdgpu: 	PERMISSION_FAULTS: 0x3
[ 2376.497669] amdgpu 0000:2f:00.0: amdgpu: 	MAPPING_ERROR: 0x1
[ 2376.497671] amdgpu 0000:2f:00.0: amdgpu: 	RW: 0x0
[ 2376.497833] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0000]
[ 2376.497846] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]
[ 2392.170827] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]

Nov 23 '23 11:11 pxl-th

Rebuilding HIP in debug mode, we get a more detailed trace:

(gdb) bt
#0  0x00007f9791508c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f96af332cd1 in amd::Os::yield () at /home/pxl-th/code/clr/rocclr/os/os_posix.cpp:418
#2  0x00007f96af33cad0 in amd::Event::awaitCompletion (this=0x28d63f0) at /home/pxl-th/code/clr/rocclr/platform/command.cpp:258
#3  0x00007f96af3443a1 in amd::HostQueue::terminate (this=0x1fba470) at /home/pxl-th/code/clr/rocclr/platform/commandqueue.cpp:67
#4  0x00007f96af36f403 in amd::ReferenceCountedObject::release (this=0x1fba470) at /home/pxl-th/code/clr/rocclr/platform/runtime.cpp:116
#5  0x00007f96af254028 in hip::Stream::Destroy (stream=0x1fba470) at /home/pxl-th/code/clr/hipamd/src/hip_stream.cpp:83
#6  0x00007f96af25bd68 in hipStreamDestroy (stream=0x1fba470) at /home/pxl-th/code/clr/hipamd/src/hip_stream.cpp:496
#7  0x00007f979011a307 in ?? ()
#8  0x0000000001170028 in ?? ()
#9  0x0000000000000000 in ?? ()

Nov 23 '23 11:11 pxl-th

dmesg output is full of page faults after running tests:

[ 2367.046840] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 2367.046848] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[ 2367.046851] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 2367.046853] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0000]
[ 2367.046854] amdgpu 0000:2f:00.0: amdgpu: 	Faulty UTCL2 client ID: CPC (0x5)
[ 2367.046858] amdgpu 0000:2f:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 2367.046860] amdgpu 0000:2f:00.0: amdgpu: 	WALKER_ERROR: 0x5
[ 2367.046862] amdgpu 0000:2f:00.0: amdgpu: 	PERMISSION_FAULTS: 0x3
[ 2367.046865] amdgpu 0000:2f:00.0: amdgpu: 	MAPPING_ERROR: 0x1
[ 2367.046867] amdgpu 0000:2f:00.0: amdgpu: 	RW: 0x0
[ 2367.046914] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]
[ 2371.109083] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]
[ 2376.497646] amdgpu 0000:2f:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 2376.497653] amdgpu 0000:2f:00.0: amdgpu:   in page starting at address 0x0000000000000000 from client 10
[ 2376.497657] amdgpu 0000:2f:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 2376.497659] amdgpu 0000:2f:00.0: amdgpu: 	Faulty UTCL2 client ID: CPC (0x5)
[ 2376.497662] amdgpu 0000:2f:00.0: amdgpu: 	MORE_FAULTS: 0x0
[ 2376.497664] amdgpu 0000:2f:00.0: amdgpu: 	WALKER_ERROR: 0x5
[ 2376.497667] amdgpu 0000:2f:00.0: amdgpu: 	PERMISSION_FAULTS: 0x3
[ 2376.497669] amdgpu 0000:2f:00.0: amdgpu: 	MAPPING_ERROR: 0x1
[ 2376.497671] amdgpu 0000:2f:00.0: amdgpu: 	RW: 0x0
[ 2376.497833] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0000]
[ 2376.497846] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]
[ 2392.170827] amdgpu 0000:2f:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xeb1d0000 flags=0x0020]

Just switched back to RX6700XT and there are no page faults at all. And everything works fine. Only occasional [ 130.457568] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.

Nov 23 '23 12:11 pxl-th

Steps to reproduce with Julia, just in case:

Download & unpack Julia 1.10: https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-1.10.0-rc1-linux-x86_64.tar.gz
Launch Julia REPL with <julia-dir>/bin/julia --threads=auto.
Enter package mode with ] key.
Add AMDGPU.jl package: add AMDGPU#master.
Run AMDGPU tests with test AMDGPU (tests take usually ~10-15 minutes to complete).
At some point tests will stop printing anything and that means one of the workers hanged.
Run C++ MWE.

Nov 23 '23 12:11 pxl-th

Running AMDGPU.jl tests with HIP debug build gives following:

┌ Error: Worker(pid=77537, terminated=true, termsignal=6) died running test item "kernelabstractions". Recording test error.
└ @ ReTestItems ~/.julia/packages/ReTestItems/HZCMZ/src/ReTestItems.jl:544

Captured logs for test setup "TSGPUArrays" (dependency of "gpuarrays - indexing find") at test/gpuarrays_tests.jl:1 on worker 77565
julia: /home/pxl-th/code/clr/rocclr/os/os_posix.cpp:310: static void amd::Os::currentStackInfo(unsigned char**, size_t*): Assertion `Os::currentStackPtr() >= *base - *size && Os::currentStackPtr() < *base && "just checking"' failed.

[77568] signal (6.-6): Aborted
in expression starting at /home/pxl-th/.julia/dev/AMDGPU/test/gpuarrays_tests.jl:4
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fd05c82871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
currentStackInfo at /home/pxl-th/code/clr/rocclr/os/os_posix.cpp:310
HostThread at /home/pxl-th/code/clr/rocclr/thread/thread.cpp:34
init at /home/pxl-th/code/clr/rocclr/thread/thread.cpp:170
init at /home/pxl-th/code/clr/rocclr/os/os_posix.cpp:170
init at /home/pxl-th/code/clr/rocclr/os/os_posix.cpp:136

Nov 23 '23 17:11 pxl-th

@pxl-th Apologies for the lack of response. Can you please test with latest ROCm 6.1.0 (HIP 6.1)? If resolved, please close ticket. Thanks!

Apr 29 '24 15:04 ppanchad-amd

Haven't seen hangs in a while, I think this is resolved! Thanks!

Jul 06 '24 10:07 pxl-th