AMDGPU.jl
AMDGPU.jl copied to clipboard
Error triggered by synchronize()
I think I'm missing something basic with synchronization.
When using a simple @roc
kernel launch inside a function we get an error in this AMDGPU.synchronize() line. The stacktrace can be seen in our CI using a recent AMDGPU.jl v0.8.6 on a MI100 with rocm 6.
I don't know if the first message in AMDGPU.jl in the stacktrace: [4] synchronize (repeats 2 times) @ ~/.julia/packages/AMDGPU/rrvsy/src/highlevel.jl:49 [inlined]
provides any hints.
Works:
@roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)
end
Fails:
@roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)
AMDGPU.synchronize()
end
For reference the CUDA code works fine:
CUDA.@sync @cuda threads = threads blocks = blocks _parallel_for_cuda(f, x...)
end
Any help would be appreciated!
It means there's an exception that's triggered by one of the kernels you run.
Sadly at the moment it doesn't say much (just GPU Kernel Exception
), I had to comment out these lines (link, link) because the functions that participate in exception reporting are not inlined thus causing maximum scratch memory usage which caused issues on the MI-series GPUs.
But you can try uncommenting them and running again to trigger the exception and see in details what's causing it.
VectorAddLambda: Error During Test at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:10
Got exception outside of a @test
GPU Kernel Exception
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] throw_if_exception(dev::AMDGPU.HIP.HIPDevice)
@ AMDGPU ~/.julia/packages/AMDGPU/rrvsy/src/exception_handler.jl:122
[3] synchronize(stm::AMDGPU.HIP.HIPStream*** blocking::Bool, stop_hostcalls::Bool)
@ AMDGPU ~/.julia/packages/AMDGPU/rrvsy/src/highlevel.jl:53
[4] synchronize (repeats 2 times)
@pxl-th thanks for the guidance, I will give it a try and report back.
To make it easier, I've pushed a branch pxl-th/exception
that has proper exception reporting, so you can use it for debugging