AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

2D cumsum throwing GPU Kernel Exception

Open rkierulf opened this issue 9 months ago • 6 comments

Several recent builds for KomaMRI.jl have begun failing with AMDGPU on Julia 1.10. Examples:

https://buildkite.com/julialang/komamri-dot-jl/builds/1418#0195b5f6-5b8f-446e-9800-f59c29ffe098 https://buildkite.com/julialang/komamri-dot-jl/builds/1420#0195ba9c-a682-4918-8cba-97d030849721 https://buildkite.com/julialang/komamri-dot-jl/builds/1417#0195b5d5-72cc-4352-b06e-489fd9865dbf

The line where it fails is here: https://github.com/JuliaHealth/KomaMRI.jl/blob/master/KomaMRICore/src/simulation/SimMethods/BlochDict/BlochDict.jl#L53

This line is just calling cumsum on a 1D ROCArray of Float32 values, and the array is also a view within a larger array. Without having access to an AMD GPU, I can't investigate much further. I wonder if this would be enough to reproduce the issue:

using AMDGPU

A = ROCArray(rand(Float32, 1000))
B = view(A, 500:600)
C = cumsum(B)

rkierulf avatar Mar 21 '25 21:03 rkierulf

Hm... Interesting. I cannot reproduce the error with MWE that you provided (my machine also runs as part of the CI).

Can you maybe output the array before cumsum to get the exact values? Maybe right before this line: https://github.com/JuliaHealth/KomaMRI.jl/blob/dc943ed3c657d9d37fbe62d706536dc4c3ea18ec/KomaMRICore/src/simulation/SimMethods/BlochDict/BlochDict.jl#L54

pxl-th avatar Mar 22 '25 20:03 pxl-th

It looks like the array is just Float32[1.0f-14, 1.0f-14]: https://buildkite.com/julialang/komamri-dot-jl/builds/1421#0195c05a-b862-499f-beda-971831f858a9. There is also a warning before about global hostcalls.

rkierulf avatar Mar 23 '25 00:03 rkierulf

Ah... My bad, the error is not related to cumsum because they are asynchronous and are checked before every kernel launch, so the fact that it errors before calling cumsum means that it appeared earlier. Can you add environment variable HIP_LAUNCH_BLOCKING=1 here to synchronize after every kernel launch immediately and then we'll see what's causing it.

Additionally, the fact that malloc hostcalls are launched means that some kernels emit exception related code that captures original value that (e.g. during rounding, conversion). In this case it's better to use gpu-friendly functions (to avoid emitting such code). For example:

gpu_floor(T, x) = unsafe_trunc(T, floor(x))
gpu_ceil(T, x) = unsafe_trunc(T, ceil(x))
gpu_cld(x, y::T) where T = (x + y - one(T)) ÷ y

You can also compare @code_llvm to see how fewer things it does vs the original floor, ceil, cld.

pxl-th avatar Mar 23 '25 10:03 pxl-th

Ok, it appears the issue is with a different cumsum here: https://github.com/JuliaHealth/KomaMRI.jl/blob/master/KomaMRIBase/src/timing/TrapezoidalIntegration.jl#L49. This is a 2D cumsum of a matrix across the second dimension. Let me know if you are unable to reproduce on your machine and I can try printing the matrix values beforehand.

rkierulf avatar Mar 23 '25 21:03 rkierulf

Still cannot reproduce... If you can print out the values maybe that will help

pxl-th avatar Mar 24 '25 13:03 pxl-th

This build has the values printed: https://buildkite.com/julialang/komamri-dot-jl/builds/1428#0195d002-435b-400a-9d1b-1df5624de035. The matrix before the call to cumsum where it crashes has shape 1 x 548 and consists of all zero Float32 values. I also noticed the result is assigned to the same matrix the cumsum is computed on: y = cumsum(y, dims=2), not sure if that affects anything. And this is happening inside tests so could be affected by --check-bounds which I think is always set to yes in package test environments.

If you still can't reproduce, I don't think this is a major issue for KomaMRI.jl since it doesn't affect the default Bloch simulation method (@cncastillo feel free to weigh in) , so should be ok to treat as lower-priority.

rkierulf avatar Mar 26 '25 04:03 rkierulf

I see latest KomaMRI AMDGPU tests pass, so I assume this is no longer a problem. Feel free to re-open if needed.

pxl-th avatar Jul 16 '25 21:07 pxl-th

Yes, this does in fact work perfectly now! Thanks!

We are still "hacking" a solution for oneAPI.jl. Related PR: https://github.com/JuliaGPU/GPUArrays.jl/pull/568. But no real urgency for that one, most people will probably use CUDA, AMDGPU or Metal.

cncastillo avatar Jul 16 '25 21:07 cncastillo