KernelAbstractions.jl icon indicating copy to clipboard operation
KernelAbstractions.jl copied to clipboard

Looping over high-dimensional arrays

Open b-fg opened this issue 2 years ago • 7 comments

Together with @weymouth we are trying to create a kernel that loops over an n-dimensional array and applies a function to each element. While we can certainly achieve to do so, the speedup we observe when comparing @kernel ("KA") and non-@kernel ("serial") implementations is very different depending of the array slice we want to access. This is probably related to Julia being C-major, but the difference is strikingly here and KA does not perform as well as the serial version.

Here is a simple MWE that demonstrates this, and this has been run with julia -t 1 to force a single thread and draw comparisons between KA and serial implementation. There is also an additional GPU test added for comparison, where the same issue is detected.

using CUDA
using BenchmarkTools

N=256
u_cu = CUDA.zeros(N,N,N,3);
u = Array(u_cu);
R1 = CartesianIndices((1:1,1:N,1:N))
R2 = CartesianIndices((1:N,1:1,1:N))
R3 = CartesianIndices((1:N,1:N,1:1))

test_serial(a,i,R) = for I ∈ R
    a[I,i] = √log1p(Float32(prod(I.I)))
end
using KernelAbstractions
@kernel function kern(a,@Const(i))
    I = @index(Global,Cartesian)
    a[I,i] = √log1p(Float32(prod(I.I)))
end
test_kernel(a,i,R) = kern(get_backend(a),64)(a,i,ndrange=size(R))

@btime test_serial($u,1,$R1)
@btime test_serial($u,2,$R2)
@btime test_serial($u,3,$R3)
@btime test_kernel($u,1,$R1)
@btime test_kernel($u,2,$R2)
@btime test_kernel($u,3,$R3)
@btime CUDA.@sync test_kernel($u_cu,1,$R1)
@btime CUDA.@sync test_kernel($u_cu,2,$R2)
@btime CUDA.@sync test_kernel($u_cu,3,$R3)

The timings are:

950.046 μs (0 allocations: 0 bytes) # serial R1
928.321 μs (0 allocations: 0 bytes) # serial R2
998.689 μs (0 allocations: 0 bytes) # serial R3
5.500 ms (10 allocations: 256 bytes) # KA CPU R1
1.045 ms (10 allocations: 256 bytes) # KA CPU R2
991.688 μs (10 allocations: 256 bytes) # KA CPU R3
217.867 μs (42 allocations: 2.23 KiB) # KA GPU R1
13.077 μs (42 allocations: 2.23 KiB) # KA GPU R2
14.328 μs (42 allocations: 2.23 KiB) # KA GPU R3

Is there something wrong in the MWE? Could this be done differently? It would be nice to learn about this. Thanks!

b-fg avatar Nov 16 '23 17:11 b-fg

@kernel function kern()
           I = @index(Global,Cartesian)
           @show I
       end
end
kern(CPU(), 64)(ndrange=(2, 3, 4))
I = CartesianIndex(1, 1, 1)
I = CartesianIndex(2, 1, 1)
I = CartesianIndex(1, 2, 1)
I = CartesianIndex(2, 2, 1)
I = CartesianIndex(1, 3, 1)
I = CartesianIndex(2, 3, 1)
I = CartesianIndex(1, 1, 2)
I = CartesianIndex(2, 1, 2)
I = CartesianIndex(1, 2, 2)
I = CartesianIndex(2, 2, 2)
I = CartesianIndex(1, 3, 2)
I = CartesianIndex(2, 3, 2)
I = CartesianIndex(1, 1, 3)
I = CartesianIndex(2, 1, 3)
I = CartesianIndex(1, 2, 3)
I = CartesianIndex(2, 2, 3)
I = CartesianIndex(1, 3, 3)
I = CartesianIndex(2, 3, 3)
I = CartesianIndex(1, 1, 4)
I = CartesianIndex(2, 1, 4)
I = CartesianIndex(1, 2, 4)
I = CartesianIndex(2, 2, 4)
I = CartesianIndex(1, 3, 4)
I = CartesianIndex(2, 3, 4)

Seems to follow the same iteration order as:

for i in CartesianIndices((2,3,4))
          @show i
end
i = CartesianIndex(1, 1, 1)
i = CartesianIndex(2, 1, 1)
i = CartesianIndex(1, 2, 1)
i = CartesianIndex(2, 2, 1)
i = CartesianIndex(1, 3, 1)
i = CartesianIndex(2, 3, 1)
i = CartesianIndex(1, 1, 2)
i = CartesianIndex(2, 1, 2)
i = CartesianIndex(1, 2, 2)
i = CartesianIndex(2, 2, 2)
i = CartesianIndex(1, 3, 2)
i = CartesianIndex(2, 3, 2)
i = CartesianIndex(1, 1, 3)
i = CartesianIndex(2, 1, 3)
i = CartesianIndex(1, 2, 3)
i = CartesianIndex(2, 2, 3)
i = CartesianIndex(1, 3, 3)
i = CartesianIndex(2, 3, 3)
i = CartesianIndex(1, 1, 4)
i = CartesianIndex(2, 1, 4)
i = CartesianIndex(1, 2, 4)
i = CartesianIndex(2, 2, 4)
i = CartesianIndex(1, 3, 4)
i = CartesianIndex(2, 3, 4)

vchuravy avatar Nov 16 '23 17:11 vchuravy

One thing to note that your kern(CPU(), 64) is equivalent to (64, 1, 1).

So I am not surprised that R1 = CartesianIndices((1:1,1:N,1:N)) is slow. For that I would expect you needing (1, 64, 1).

vchuravy avatar Nov 16 '23 17:11 vchuravy

Say what now? This isn't spilt up automatically?

weymouth avatar Nov 16 '23 17:11 weymouth

Well I thought I had documented that clearly, but I seem to not find it...

Take a look at: https://juliagpu.github.io/KernelAbstractions.jl/stable/examples/performance/

the workgroupsize is also a tuple where you provide the dimensions of the workgroup.

vchuravy avatar Nov 16 '23 17:11 vchuravy

Ok. That explains everything. Well need to redo our macro to make the workgroup size adaptive. While I'm at it, is there a way to predict the best workgroup size to use?

weymouth avatar Nov 16 '23 18:11 weymouth

Understood @vchuravy! Thanks for clarifying.

b-fg avatar Nov 16 '23 19:11 b-fg

A bit more on this, it looks like if we try to evaluate the required workgroupsize during runtime using

workgroupsize = ntuple(j->j==argmax(size(R)) ? 64 : 1, length(size(R)))

and then passing it to the kernel argument, this results in much slower kernels. On the other hand, hardcoding it to 64 or (64,1,1) results in a much faster computation. Is there a specific reason why this might be happening?

Edit:

Actually, I have seen that the macro that generates the kernel is sometimes failing to produce the expected result of workgroupsize. So for example, when size(R)=(258,258,1) it results in (1,1,1) (it should be (64,64,1)) and this is why it is slow. So this is not a KA problem I believe, but the way we are generating the kernels in the macro... cc @weymouth.

b-fg avatar Jan 15 '24 13:01 b-fg