AMDGPU.jl icon indicating copy to clipboard operation
AMDGPU.jl copied to clipboard

Error triggered by synchronize()

Open williamfgc opened this issue 1 year ago • 3 comments

I think I'm missing something basic with synchronization.

When using a simple @roc kernel launch inside a function we get an error in this AMDGPU.synchronize() line. The stacktrace can be seen in our CI using a recent AMDGPU.jl v0.8.6 on a MI100 with rocm 6. I don't know if the first message in AMDGPU.jl in the stacktrace: [4] synchronize (repeats 2 times) @ ~/.julia/packages/AMDGPU/rrvsy/src/highlevel.jl:49 [inlined] provides any hints.

Works:

 @roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)
end

Fails:

 @roc groupsize = threads gridsize = threads * blocks _parallel_for_amdgpu(f, x...)
  AMDGPU.synchronize()
end

For reference the CUDA code works fine:

  CUDA.@sync @cuda threads = threads blocks = blocks _parallel_for_cuda(f, x...)
end

Any help would be appreciated!

williamfgc avatar Feb 23 '24 20:02 williamfgc

It means there's an exception that's triggered by one of the kernels you run. Sadly at the moment it doesn't say much (just GPU Kernel Exception), I had to comment out these lines (link, link) because the functions that participate in exception reporting are not inlined thus causing maximum scratch memory usage which caused issues on the MI-series GPUs.

But you can try uncommenting them and running again to trigger the exception and see in details what's causing it.

VectorAddLambda: Error During Test at /home/wfg/github-runners/cousteau-JACC/ci/_work/JACC.jl/JACC.jl/test/tests_amdgpu.jl:10
  Got exception outside of a @test
  GPU Kernel Exception
  Stacktrace:
    [1] error(s::String)
      @ Base ./error.jl:35
    [2] throw_if_exception(dev::AMDGPU.HIP.HIPDevice)
      @ AMDGPU ~/.julia/packages/AMDGPU/rrvsy/src/exception_handler.jl:122
    [3] synchronize(stm::AMDGPU.HIP.HIPStream*** blocking::Bool, stop_hostcalls::Bool)
      @ AMDGPU ~/.julia/packages/AMDGPU/rrvsy/src/highlevel.jl:53
    [4] synchronize (repeats 2 times)

pxl-th avatar Feb 23 '24 20:02 pxl-th

@pxl-th thanks for the guidance, I will give it a try and report back.

williamfgc avatar Feb 23 '24 21:02 williamfgc

To make it easier, I've pushed a branch pxl-th/exception that has proper exception reporting, so you can use it for debugging

pxl-th avatar Feb 24 '24 08:02 pxl-th