GPUifyLoops.jl
GPUifyLoops.jl copied to clipboard
Example showing 3D stencil calculations with a divergence
Example might be useful to others. I get a speedup of ~130x with this small kernel (compared to single CPU core, so it's not really a fair comparison) so I think I did it right.
Resolves #12
gpuIndex3D() = CartesianIndex(
blockIdx().z,
blockIdx().y - 1) * blockDim().y + threadIdx().y,
blockIdx().x - 1) * blockDim().x + threadIdx().x
)
# Calculate the divergence of f at every point and store it in div_f.
@loop for I in (eachindex(f); gpuIndex3D())
@inbounds div_f[I] = div(f, Nx, Ny, Nz, Δx, Δy, Δz, I)
end
Something like this for index calc
maxThreads = 1024
Nx, Ny, Nz = size(f)
Tx = min(maxThreads, Nx)
Ty = min(fld(maxThreads, Tx), Ny)
Tz = min(fld(maxThreads, (Tx*Ty)), Nz)
Bx, By, Bz = cld(Nx, Tx), cld(Ny, Ty), cld(Nz, Tz) # Blocks in grid.