Tullio.jl Simple gpu loop on CUDA does not return

Simple gpu loop on CUDA does not return

Open pagnani opened this issue 3 years ago • 1 comments

On julia 1.7.2 creating a new environment with only the included packages (see below)

using Tullio, CUDA, LoopVectorization, CUDAKernels, KernelAbstractions
function gpr(N, L)
    Jseq = rand(Float32, N + 2, N + 2, L, L, 2, 2) |> cu
    conditional = rand(Float32, N + 2, N + 2, L, L, 2, 2) |> cu
    @tullio g[nl, nl1, l, xl, xl1] := conditional[ni, nl, i, l, xi, xl] * Jseq[ni, nj, i, j, xi, xj] * conditional[nj, nl1, j, l+1, xj, xl1] * (i <= l) * (j > l) * (j > i + 1)
    
    return g
end
julia> N=5; L=3; gpr(N,L)

never returns (and GPU usage 100%)

Pkg status status

  [052768ef] CUDA v3.8.0
  [72cfdca4] CUDAKernels v0.3.3
  [63c18a36] KernelAbstractions v0.7.2
  [bdcacae8] LoopVectorization v0.12.101
  [bc48ee85] Tullio v0.3.3

CuDevice(0): TITAN RTX CUDA 11.0.0

Thanks a lot!

Feb 14 '22 14:02 pagnani

Thanks for the report. I can reproduce this, but have no idea what causes it.

It works on the CPU, with threads=false (to use KA) and verbose=true (to know):

julia> N=5; L=3; gpr(N,L)
┌ Info: left index ranges
│   nl = Base.OneTo(7)
│   nl1 = Base.OneTo(7)
│   l = 1:2
│   xl = Base.OneTo(2)
└   xl1 = Base.OneTo(2)
┌ Info: reduction index ranges
│   ni = Base.OneTo(7)
│   i = Base.OneTo(3)
│   xi = Base.OneTo(2)
│   nj = Base.OneTo(7)
│   j = Base.OneTo(3)
└   xj = Base.OneTo(2)
[ Info: running KernelAbstractions CPU actor 
7×7×2×2×2 Array{Float32, 5}:
[:, :, 1, 1, 1] =
 19.8817  22.2586  23.2881  20.2121  19.9547  22.5193  20.0603
 ...

On the GPU, it still seems to hang if I comment out * (i <= l) * (j > l) * (j > i + 1).

I wonder if this is just too many loops for KA to handle, or hits some e.g. factorial optimisation step? 11 nested loops is quite deep, and it may be that nobody tested that many. If so, the next step is probably to run it with verbose=2 which will print out the kernel being used, from which we can try to reproduce this without Tullio.

Feb 19 '22 17:02 mcabbott

Tullio.jl Tullio.jl copied to clipboard

Simple gpu loop on CUDA does not return

Tullio.jl
Tullio.jl copied to clipboard