Metal.jl
Metal.jl copied to clipboard
slow broadcast copy in 2D
The following code evaluates the performance of the copy of 2 2D square MTL arrays a and b.
It gives a good performance (GBs: 360 GBs) using the kernel version (commented line) but a poor performance (GBs: 46.2) using the broadcast expression (a .= b)...
Note that the broadcast expression is OK (equivalent to kernel copy) for 1D arrays since the last bug fix.
using Metal
function kernel_copy!(a, b)
(i,j) = thread_position_in_grid_2d()
@inbounds a[i,j] = b[i,j]
return
end
function device_copy(n=2^14,nsample=10)
a = MtlArray(rand(Float32, n,n))
b = MtlArray(rand(Float32, n,n))
threads = (32,32)
grid_size = cld.(n, threads)
@show threads,grid_size
ts=zeros(nsample)
for i ∈ 1:nsample
ts[i] = @elapsed Metal.@sync begin
# @metal threads=threads grid=grid_size kernel_copy!(a, b)
a .= b
end
end
@assert Array(a)==Array(b)
@show ts
tmin = minimum(ts)
size_in_bytes = 2*length(a)*sizeof(Float32) #1R+1W
byte_per_ns = size_in_bytes / (tmin*1.e9)
println("GBs: $(round(byte_per_ns; digits=3))")
# Cleanup memory (is it necessary)
finalize(a)
finalize(b)
end
device_copy()