Metal.jl icon indicating copy to clipboard operation
Metal.jl copied to clipboard

slow broadcast copy in 2D

Open LaurentPlagne opened this issue 3 years ago • 0 comments

The following code evaluates the performance of the copy of 2 2D square MTL arrays a and b. It gives a good performance (GBs: 360 GBs) using the kernel version (commented line) but a poor performance (GBs: 46.2) using the broadcast expression (a .= b)...

Note that the broadcast expression is OK (equivalent to kernel copy) for 1D arrays since the last bug fix.

using Metal

function kernel_copy!(a, b)
    (i,j) = thread_position_in_grid_2d()
    @inbounds a[i,j] = b[i,j]
    return
end

function device_copy(n=2^14,nsample=10)

    a = MtlArray(rand(Float32, n,n))
    b = MtlArray(rand(Float32, n,n))

    threads = (32,32)
    grid_size = cld.(n, threads)
    @show threads,grid_size

    ts=zeros(nsample)
    for i ∈ 1:nsample
        ts[i] = @elapsed Metal.@sync begin
            # @metal threads=threads grid=grid_size kernel_copy!(a, b)
            a .= b
        end
    end

    @assert Array(a)==Array(b)

    @show ts
    tmin = minimum(ts)

    size_in_bytes = 2*length(a)*sizeof(Float32) #1R+1W
    byte_per_ns = size_in_bytes / (tmin*1.e9)

    println("GBs: $(round(byte_per_ns; digits=3))")

    # Cleanup memory (is it necessary)
    finalize(a)
    finalize(b)
end
device_copy()

LaurentPlagne avatar Jul 11 '22 16:07 LaurentPlagne