GPUArrays.jl KernelAbstractions version of GPUArrays

Right, tests are passing locally, but still a bunch of small things to do:

[x] Rebase / squash
[ ] Cleanup (make sure the workflows rely on the appropriate branches for testing)
[x] Docs
[x] Corresponding PRs
- [x] AMDGPU
- [x] CUDA
- [x] Metal
- [x] oneAPI
[x] Check for performance regressions

This PR supersedes #451 and should be ready for review next week. Just giving everyone a sneak peek now.

Mar 29 '24 12:03 leios

This PR breaks the interface to CUDA, so the buildkite tests will fail unless I point to a specific PR (working on that now). How do I do that with buildkite?

Also: if this is merged, we might want to create a new release

Mar 30 '24 14:03 leios

Maybe temporarily change the Pkg invocation during CI to pick up the accompanying back-end PRs: https://github.com/JuliaGPU/GPUArrays.jl/blob/462322603cae08520a2297798c026f3be259df12/.buildkite/pipeline.yml#L13

Apr 05 '24 06:04 maleadt

Ok, can't quite figure out the oneAPI test failures. It is breaking here: https://github.com/leios/GPUArrays.jl/blob/yoyoyo_rebase_time/test/testsuite/statistics.jl#L66 (and at a similar line above).

Here is the compare function:

function compare(f, AT::Type{<:AbstractGPUArray}, xs...; kwargs...)
    # copy on the CPU, adapt on the GPU, but keep Ref's
    cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), xs)
    gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x)
, xs)

    cpu_out = f(cpu_in...)
    gpu_out = f(gpu_in...)

    test_result(cpu_out, gpu_out; kwargs...)
end

and here are the relevant test_result functions:

function test_result(a::AbstractArray{T}, b::AbstractArray{T};
                     kwargs...) where {T<:NTuple{N,<:Number} where {N}}
    ET = eltype(T)
    ≈(reinterpret(ET, collect(a)), reinterpret(ET, collect(b)); kwargs...)
end
function test_result(as::NTuple{N,Any}, bs::NTuple{N,Any}; kwargs...) where {N}
    all(zip(as, bs)) do (a, b)
        test_result(a, b; kwargs...)
    end
end

So I took the bodies from these and just pasted them into the test:

        @testset "cov" begin
            s = 100
            @test compare(cov, AT, rand(ET, s))
            @test compare(cov, AT, rand(ET, s, 2))

            # copy of `compare` contents
            f = A->cov(A; dims=2)
            rand_array = rand(ET, s, 2)
            cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), rand_array)
            gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x), rand_array)
            cpu_out = f(cpu_in)
            gpu_out = f(gpu_in)
            
            # test_result prints `true` here
            println("cov:", '\t', ET, '\t', AT, '\t', cpu_out == gpu_out, '\t',
                    test_result(cpu_out, gpu_out))
            @test compare(A->cov(A; dims=2), AT, rand(ET, s, 2))
            if ET <: Real
                @test compare(cov, AT, rand(ET(1):ET(100), s))
            end

This returns true.

So now I don't understand why the compare(...) function isn't passing, but the same lines are when just put into the function. Also interesting note: the tests all pass when the dummy array size is smaller (like 10).

Also: I am just testing this by throwing it against CI because I couldn't find an Intel GPU...

May 22 '24 14:05 leios

oneAPI passes the tests now. I don't know why. The best I can figure is that we are somehow no longer triggering https://github.com/JuliaGPU/oneAPI.jl/issues/442 (which seems to be hardware dependent)

Jun 06 '24 09:06 leios

check

Quick note for performance regressions. I ran all the tests on this branch (blue) and plotted them here against master (orange / red). In general, they are the same speed. I also ran the cases where master was faster separately and found that these tests are still generally the same speed.

I can look into this in more detail by automating the process, but I think it might be better to come up with a specific test case where KA is almost certainly slower than the current GPUArrays DSL. It would be a good idea to list all the reasons why KA could be slow in an issue or something so we can tackle them.

Jun 11 '24 12:06 leios

Ok, it seems like this branch works and is ready for review. There is some overall cleanup left, but I'll do that after.

Jun 27 '24 18:06 leios

Oh, I didn't realize this was ready for review. We should get this merged!

About the CUDA.jl branch, https://github.com/leios/CUDA.jl/commit/b085472a546a22ad08982079bf83d89e0f3b2e35, what is the reason this requires a separate CuArrayBackend?

Jul 22 '24 07:07 maleadt

You are right. All the XArrayBackends can be removed. Let me mess around and test locally.

The main thing that stalled the PR is that I couldn't figure out the CI on the CUDA side and got swamped with other things.

Jul 22 '24 08:07 leios

Now that we're past JuliaCon I should have the time to help out, so feel free to just list issues here.

In parallel, I'll be looking at packaging POCL so that we can hopefully move forwards on an improved CPU back-end for KA.jl too.

Jul 22 '24 08:07 maleadt

I was literally just about to create an issue in KA about that. I'll go ahead and rebase everything up for this (these) PRs

Jul 22 '24 08:07 leios

Just rebased up (also had to revert the enzyme stuff). All tests pass locally on AMDGPU. Could we rerun the CI to make sure the errors are consistent on each backend's master / main?

Jul 22 '24 11:07 leios

So the main problem is with launch_heuristic and launch_configuration as well as KernelAbstractions.launch_config. These are kinda conflicting and I'm not actually sure how to write the launch_heuristic function in CUDA/gpuarrays.jl.

This one doesn't work because we need to read in an ndrange when doing the config.

@inline function GPUArrays.launch_heuristic(::CUDABackend, f::F, args::Vararg{Any,N};
                                            elements::Int, elements_per_thread::Int) where {F,N}

    obj = f(CUDABackend())
    ndrange, workgroupsize, iterspace, dynamic = KA.launch_config(obj, nothing,
                                                                  nothing)

    # this might not be the final context, since we may tune the workgroupsize
    ctx = KA.mkcontext(obj, ndrange, iterspace)
    kernel = @cuda launch=false obj.f(ctx, args...)

    # launching many large blocks) lowers performance, as observed with broadcast, so cap
    # the block size if we don't have a grid-stride kernel (which would keep the grid small)
    if elements_per_thread > 1
        launch_configuration(kernel.fun)
    else
        launch_configuration(kernel.fun; max_threads=256)
    end
end

The tests passed earlier because we weren't calling the right launch_heuristic from CUDA.

Jul 22 '24 14:07 leios

Something is really wrong with Metal.jl's map! on this PR:

julia> map!(-, Metal.zeros(Float32, 1), Metal.ones(Float32, 1))
1-element MtlVector{Float32, Private}:
 -1.0

julia> map!(-, Metal.zeros(Float32, 2), Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
 0.0
 0.0

julia> Metal.zeros(Float32, 2) .= .-(Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
 -1.0
 -1.0

It doesn't seem launch configuration related because I can see 2 threads being launched here, as expected.

Jul 25 '24 07:07 maleadt

Map goes through launch_heuristic on Metal's side. I might have really messed up the logic somehow? AMDGPU doesn't have launch_heuristic and instead falls back to the default one defined here (in GPUArrays). It seems to be fine:

julia> map!(-, AMDGPU.zeros(Float32, 1), AMDGPU.ones(Float32, 1))
1-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0

julia> map!(-, AMDGPU.zeros(Float32, 2), AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0
 -1.0

julia> AMDGPU.zeros(Float32, 2) .= .-(AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
 -1.0
 -1.0

I might be tired, but those look right to me. I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? https://github.com/JuliaGPU/Metal.jl/pull/328

Jul 25 '24 07:07 leios

I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? JuliaGPU/Metal.jl#328

That doesn't help, as the launch configuration seems correct: config = (threads = 2, blocks = 1, elements_per_thread = 1), so the launch heuristic isn't to blame (and to be sure, I made it fall back to the GPUArrays definition, which didn't change a thing).

Jul 25 '24 08:07 maleadt

This looks like a miscompilation in Metal.jl. I'll investigate. For a workaround: removing the for i in 1:nelem seems to help, so maybe we should get rid of the grid-stride processing here already. elements_per_thread is going to be always 1 anyway, and it would simplify the "scary" indexing in the kernel ((J-1)*nelem + j). It would probably be best to add grid-stride indexing support to KA.jl before revisiting this in the future.

Jul 25 '24 09:07 maleadt

Actually, found another workaround: add @inbounds to the CartesianIndices constructor.

Jul 25 '24 09:07 maleadt

Just to be clear, there are 2 options:

@inbounds J_c = CartesianIndices(axes(bc))[(J-1)*nelem + j]
Remove the launch_heuristic / elements_per_thread approach.

You are in favor of 2 with the hopes of tackling it in KA down the road?

Jul 25 '24 10:07 leios

Yeah. As I mentioned on Slack, I haven't noticed much performance improvements from grid stride loops here, and seeing how it uglifies the indexing I'd prefer we just get rid of it right now until KA.jl properly supports them.

In the case we need to do manual CartesianIndex construction though, it should be annotated @inbounds, not to avoid the Metal.jl compiler bug but to avoid the unnecessary trap branch.

Jul 25 '24 12:07 maleadt

Running out of time to keep debugging this today, so I'll just write everything down.

After removing the launch_heuristic and elements_per_thread approach, I am getting errors with map for certain array sizes and also in minimum!, maximum! and extrema! for different array sizes. The issue with broadcast is weird:

BoundsError: attempt to access Base.Broadcast.Broadcasted{JLArrays.JLArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{JLArrays.JLDeviceArray{ComplexF64, 2}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}} at index [1]

So it's failing to access a Broadcasted of size (2,2) at index 1. Note that the same lines in the REPL do work:

map!(+, AMDGPU.zeros(Float32, 10,10), AMDGPU.ones(Float32, 10,1))

So the errors only appear when running ] test for broadcasting.

The other issue is with minimum!, maximum! and extrema! for different array sizes. Again, I am struggling to exactly pinpoint this issue, but it happens when sz = (10, 10, 10) and red = (1, 10, 10). It doesn't break on other array sizes. To be clear sz = (10, 10, 10) and red = (1, 1, 1) seems to work.

Long story short. There's something wrong with map. I am sure it's just a stupid typo that I can't quite pick out right now:

function Base.map!(f, dest::AnyGPUArray, xs::AbstractArray...)
    # custom broadcast, ignoring the container size mismatches
    # (avoids the reshape + view that our mapreduce impl has to do)
    indices = LinearIndices.((dest, xs...))
    common_length = minimum(length.(indices))
    common_length==0 && return

    bc = Broadcast.instantiate(Broadcast.broadcasted(f, xs...))
    if bc isa Broadcast.Broadcasted
        bc = Broadcast.preprocess(dest, bc)
    end

    # grid-stride kernel
    @kernel function map_kernel(dest, bc)
        j = @index(Global, Linear)
        @inbounds dest[j] = bc[j]
    end

    kernel = map_kernel(get_backend(dest))
    config = KernelAbstractions.launch_config(kernel, common_length, nothing)
    kernel(dest, bc; ndrange = config[1], workgroupsize = config[2])

    if eltype(dest) <: BrokenBroadcast
        throw(ArgumentError("Map operation resulting in $(eltype(eltype(dest))) is not GPU compatible"))
    end

    return dest
end

Jul 26 '24 11:07 leios

FYI, although I haven't had the time to look into this further, the Metal miscompilation has been fixed.

Aug 29 '24 09:08 maleadt

Oh, great! Tbh, I had to put this on hold for August because our daycare is closed this month and I'm juggling childcare duties. I should be able to pick it back up in September

Aug 29 '24 11:08 leios

I couldn't quite get https://github.com/JuliaGPU/GPUArrays.jl/pull/525/commits/00c8dd4912c5d1c4c4260f0b8baecf71647d52ad to work, so I reverted it to check to see if Metal would build.

It seems like now CUDA and AMDGPU (locally) both pass, but I'm not sure what's going on with Metal and oneAPI

Sep 16 '24 13:09 leios

The launch_heuristic/launch_configuration is not something that can stay, though. I have some time tomorrow and next week, so I can have a look if you want.

Sep 19 '24 12:09 maleadt

https://github.com/JuliaGPU/GPUArrays.jl/pull/559 is looking good (all CI green), so let's close this branch on favor of it. Huge thanks to @leios for doing the grunt work!

Sep 20 '24 16:09 maleadt

Congratulations @leios !

Sep 20 '24 20:09 vchuravy