Enzyme.jl icon indicating copy to clipboard operation
Enzyme.jl copied to clipboard

Enzyme + KA Stalls on Error instead of reporting it

Open pxl-th opened this issue 3 years ago • 11 comments

When device = CPU() and Julia is started with more than one thread (e.g. -t16), the program stalls.

MWE:

using Enzyme
using KernelAbstractions
using KernelGradients

linear_threads(::CPU) = Threads.nthreads()
Base.zeros(::CPU, ::Type{T}, shape) where T = zeros(T, shape)
Base.ones(::CPU, ::Type{T}, shape) where T = ones(T, shape)
Base.rand(::CPU, ::Type{T}, shape) where T = rand(T, shape)

function ∇spherical_harmonics!(∂L∂x, ∂L∂y, x, y, device)
    n = size(x, 2)
    ∇! = Enzyme.autodiff(
        spherical_harmonics_kernel!(device, linear_threads(device)))
    wait(∇!(Duplicated(y, ∂L∂y), Duplicated(x, ∂L∂x); ndrange=n))
end

@kernel function spherical_harmonics_kernel!(encodings, @Const(directions))
    i = @index(Global)
    x = directions[1, i]
    y = directions[2, i]
    z = directions[3, i]

    encodings[1, i] = 0.28209479177387814f0
    encodings[2, i] = -0.48860251190291987f0 * y
    encodings[3, i] = 0.48860251190291987f0 * z
    encodings[4, i] = -0.48860251190291987f0 * x
end

function main()
    device = CPU()
    n = 1024

    x = rand(device, Float32, (3, n))
    y = zeros(device, Float32, (4, n))
    ∂L∂y = ones(device, Float32, (4, n))
    ∂L∂x = zeros(device, Float32, (3, n))

    ∇spherical_harmonics!(∂L∂x, ∂L∂y, x, y, device)
end
main()

Details:

  • Julia 1.8.0-rc1
  • ]st
  [7da242da] Enzyme v0.10.1
  [63c18a36] KernelAbstractions v0.8.2
  [e5faadeb] KernelGradients v0.1.2

pxl-th avatar Jun 17 '22 12:06 pxl-th

https://github.com/JuliaGPU/KernelAbstractions.jl/issues/298 seems to be related issue.

pxl-th avatar Jun 17 '22 12:06 pxl-th

Can you use redirect the output to a file an post that?

Any you are saying that instead of terminating upon error it's just hanging and waiting?

vchuravy avatar Jun 17 '22 14:06 vchuravy

Ah no it's stalling on the CPU and erroring in the GPU.

vchuravy avatar Jun 17 '22 14:06 vchuravy

@vchuravy here's the output: error.txt It is for when device = CUDADevice()

pxl-th avatar Jun 17 '22 15:06 pxl-th

Thanks for the CPU part can you try https://docs.julialang.org/en/v1.8.0-rc1/stdlib/Profile/#Triggered-During-Execution

vchuravy avatar Jun 17 '22 15:06 vchuravy

Actually for the CPU I just needed to run Julia with one thread (as opposed to auto): cpu-error.txt

pxl-th avatar Jun 17 '22 15:06 pxl-th

Just as a note see the:

@exception9 = private unnamed_addr constant [25 x i8] c"undefined variable error\00", align 1

In the output? That means you have an undefined variable. Likely directions and encodings.

vchuravy avatar Jun 17 '22 15:06 vchuravy

Yikes! That was indeed the problem :)

pxl-th avatar Jun 17 '22 16:06 pxl-th

@vchuravy thanks! :)

pxl-th avatar Jun 17 '22 16:06 pxl-th

Well we shouldn't stall, but actually error... So something dastardly going on.

vchuravy avatar Jun 17 '22 16:06 vchuravy

I've updated the MWE. Now there is no errors, but when Julia is started with more than one thread on CPU it stalls. If you set to start Julia with only one thread, it completes alright.

CUDADevice is fine.

pxl-th avatar Jun 17 '22 22:06 pxl-th

@pxl-th does this still error for you?

wsmoses avatar Aug 18 '23 07:08 wsmoses

Hm... actually yes, just tried it on 1.10-beta2.

pxl-th avatar Sep 02 '23 20:09 pxl-th

My bad, I forgot that you don't need KernelGradients now, so it installed an old version. With the updated code it works.

pxl-th avatar Oct 30 '23 16:10 pxl-th