KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard
Slow simple 2D copy kernel with Metal backend
Hi,
I try to use KA for the first time and I wonder about the performance I obtain for a simple kernel copying 2 2D matrices of Float32 (I know that I could copy them as vectors) :
using Metal
using KernelAbstractions
using Random
using BenchmarkTools
@kernel function copy2D_kernel!(b, a)
i, j = @index(Global, NTuple)
@inbounds b[i, j] = a[i, j]
end
function copy2D!(b, a)
backend = get_backend(a)
groupsize = KernelAbstractions.isgpu(backend) ? 256 : 1024
kernel! = copy2D_kernel!(backend, groupsize)
kernel!(b, a, ndrange=size(a))
end
function go()
res = 2^14
# creating initial cpu arrays
a_cpu = rand(Float32, res, res)
b_cpu = zeros(Float32, res, res)
@info("size of a,b (GB) :",2sizeof(a_cpu)/(1.e9))
# creating initial gpu arrays
a = MtlArray(a_cpu)
b = MtlArray(b_cpu)
backend = get_backend(a)
gpu_elapsed = @belapsed begin
copy2D!($b,$a)
KernelAbstractions.synchronize($backend)
end
cpu_elapsed = @belapsed $a_cpu .= $b_cpu
bandwidth_GBs(res,t,T) = sizeof(T)*res*res*2/(t*1.e9)
@info(cpu_elapsed,bandwidth_GBs(res,cpu_elapsed,Float32))
@info(gpu_elapsed,bandwidth_GBs(res,gpu_elapsed,Float32))
nothing
end
And I obtain (mbp M1Max) a cpu simple copy twice as fast at the KA GPU one...
┌ Info: size of a,b (GB) : └ (2 * sizeof(a_cpu)) / 1.0e9 = 2.147483648 ┌ Info: 0.022282291 └ bandwidth_GBs(res, cpu_elapsed, Float32) = 96.37625000050488 ┌ Info: 0.047214875 └ bandwidth_GBs(res, gpu_elapsed, Float32) = 45.48320096156137
Any hint ?
Laurent
how do your benchmarks vary with groupsize and res? are there regions in that space for which the GPU is faster??
It looks rather stable for res in {2^15,2^16} and groupsize in {126,256,512,1024}