Tullio.jl
Tullio.jl copied to clipboard
Simple gpu loop on CUDA does not return
On julia 1.7.2 creating a new environment with only the included packages (see below)
using Tullio, CUDA, LoopVectorization, CUDAKernels, KernelAbstractions
function gpr(N, L)
Jseq = rand(Float32, N + 2, N + 2, L, L, 2, 2) |> cu
conditional = rand(Float32, N + 2, N + 2, L, L, 2, 2) |> cu
@tullio g[nl, nl1, l, xl, xl1] := conditional[ni, nl, i, l, xi, xl] * Jseq[ni, nj, i, j, xi, xj] * conditional[nj, nl1, j, l+1, xj, xl1] * (i <= l) * (j > l) * (j > i + 1)
return g
end
julia> N=5; L=3; gpr(N,L)
never returns (and GPU usage 100%)
Pkg status status
[052768ef] CUDA v3.8.0
[72cfdca4] CUDAKernels v0.3.3
[63c18a36] KernelAbstractions v0.7.2
[bdcacae8] LoopVectorization v0.12.101
[bc48ee85] Tullio v0.3.3
CuDevice(0): TITAN RTX CUDA 11.0.0
Thanks a lot!
Thanks for the report. I can reproduce this, but have no idea what causes it.
It works on the CPU, with threads=false
(to use KA) and verbose=true
(to know):
julia> N=5; L=3; gpr(N,L)
┌ Info: left index ranges
│ nl = Base.OneTo(7)
│ nl1 = Base.OneTo(7)
│ l = 1:2
│ xl = Base.OneTo(2)
└ xl1 = Base.OneTo(2)
┌ Info: reduction index ranges
│ ni = Base.OneTo(7)
│ i = Base.OneTo(3)
│ xi = Base.OneTo(2)
│ nj = Base.OneTo(7)
│ j = Base.OneTo(3)
└ xj = Base.OneTo(2)
[ Info: running KernelAbstractions CPU actor
7×7×2×2×2 Array{Float32, 5}:
[:, :, 1, 1, 1] =
19.8817 22.2586 23.2881 20.2121 19.9547 22.5193 20.0603
...
On the GPU, it still seems to hang if I comment out * (i <= l) * (j > l) * (j > i + 1)
.
I wonder if this is just too many loops for KA to handle, or hits some e.g. factorial optimisation step? 11 nested loops is quite deep, and it may be that nobody tested that many. If so, the next step is probably to run it with verbose=2
which will print out the kernel being used, from which we can try to reproduce this without Tullio.