KernelAbstractions version of GPUArrays
Right, tests are passing locally, but still a bunch of small things to do:
- [x] Rebase / squash
- [ ] Cleanup (make sure the workflows rely on the appropriate branches for testing)
- [x] Docs
- [x] Corresponding PRs
- [x] AMDGPU
- [x] CUDA
- [x] Metal
- [x] oneAPI
- [x] Check for performance regressions
This PR supersedes #451 and should be ready for review next week. Just giving everyone a sneak peek now.
This PR breaks the interface to CUDA, so the buildkite tests will fail unless I point to a specific PR (working on that now). How do I do that with buildkite?
Also: if this is merged, we might want to create a new release
Maybe temporarily change the Pkg invocation during CI to pick up the accompanying back-end PRs: https://github.com/JuliaGPU/GPUArrays.jl/blob/462322603cae08520a2297798c026f3be259df12/.buildkite/pipeline.yml#L13
Ok, can't quite figure out the oneAPI test failures. It is breaking here: https://github.com/leios/GPUArrays.jl/blob/yoyoyo_rebase_time/test/testsuite/statistics.jl#L66 (and at a similar line above).
Here is the compare function:
function compare(f, AT::Type{<:AbstractGPUArray}, xs...; kwargs...)
# copy on the CPU, adapt on the GPU, but keep Ref's
cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), xs)
gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x)
, xs)
cpu_out = f(cpu_in...)
gpu_out = f(gpu_in...)
test_result(cpu_out, gpu_out; kwargs...)
end
and here are the relevant test_result functions:
function test_result(a::AbstractArray{T}, b::AbstractArray{T};
kwargs...) where {T<:NTuple{N,<:Number} where {N}}
ET = eltype(T)
≈(reinterpret(ET, collect(a)), reinterpret(ET, collect(b)); kwargs...)
end
function test_result(as::NTuple{N,Any}, bs::NTuple{N,Any}; kwargs...) where {N}
all(zip(as, bs)) do (a, b)
test_result(a, b; kwargs...)
end
end
So I took the bodies from these and just pasted them into the test:
@testset "cov" begin
s = 100
@test compare(cov, AT, rand(ET, s))
@test compare(cov, AT, rand(ET, s, 2))
# copy of `compare` contents
f = A->cov(A; dims=2)
rand_array = rand(ET, s, 2)
cpu_in = map(x -> isa(x, Base.RefValue) ? x[] : deepcopy(x), rand_array)
gpu_in = map(x -> isa(x, Base.RefValue) ? x[] : adapt(ArrayAdaptor{AT}(), x), rand_array)
cpu_out = f(cpu_in)
gpu_out = f(gpu_in)
# test_result prints `true` here
println("cov:", '\t', ET, '\t', AT, '\t', cpu_out == gpu_out, '\t',
test_result(cpu_out, gpu_out))
@test compare(A->cov(A; dims=2), AT, rand(ET, s, 2))
if ET <: Real
@test compare(cov, AT, rand(ET(1):ET(100), s))
end
This returns true.
So now I don't understand why the compare(...) function isn't passing, but the same lines are when just put into the function. Also interesting note: the tests all pass when the dummy array size is smaller (like 10).
Also: I am just testing this by throwing it against CI because I couldn't find an Intel GPU...
oneAPI passes the tests now. I don't know why. The best I can figure is that we are somehow no longer triggering https://github.com/JuliaGPU/oneAPI.jl/issues/442 (which seems to be hardware dependent)
Quick note for performance regressions. I ran all the tests on this branch (blue) and plotted them here against master (orange / red). In general, they are the same speed. I also ran the cases where master was faster separately and found that these tests are still generally the same speed.
I can look into this in more detail by automating the process, but I think it might be better to come up with a specific test case where KA is almost certainly slower than the current GPUArrays DSL. It would be a good idea to list all the reasons why KA could be slow in an issue or something so we can tackle them.
Ok, it seems like this branch works and is ready for review. There is some overall cleanup left, but I'll do that after.
Oh, I didn't realize this was ready for review. We should get this merged!
About the CUDA.jl branch, https://github.com/leios/CUDA.jl/commit/b085472a546a22ad08982079bf83d89e0f3b2e35, what is the reason this requires a separate CuArrayBackend?
You are right. All the XArrayBackends can be removed. Let me mess around and test locally.
The main thing that stalled the PR is that I couldn't figure out the CI on the CUDA side and got swamped with other things.
Now that we're past JuliaCon I should have the time to help out, so feel free to just list issues here.
In parallel, I'll be looking at packaging POCL so that we can hopefully move forwards on an improved CPU back-end for KA.jl too.
I was literally just about to create an issue in KA about that. I'll go ahead and rebase everything up for this (these) PRs
Just rebased up (also had to revert the enzyme stuff). All tests pass locally on AMDGPU. Could we rerun the CI to make sure the errors are consistent on each backend's master / main?
So the main problem is with launch_heuristic and launch_configuration as well as KernelAbstractions.launch_config. These are kinda conflicting and I'm not actually sure how to write the launch_heuristic function in CUDA/gpuarrays.jl.
This one doesn't work because we need to read in an ndrange when doing the config.
@inline function GPUArrays.launch_heuristic(::CUDABackend, f::F, args::Vararg{Any,N};
elements::Int, elements_per_thread::Int) where {F,N}
obj = f(CUDABackend())
ndrange, workgroupsize, iterspace, dynamic = KA.launch_config(obj, nothing,
nothing)
# this might not be the final context, since we may tune the workgroupsize
ctx = KA.mkcontext(obj, ndrange, iterspace)
kernel = @cuda launch=false obj.f(ctx, args...)
# launching many large blocks) lowers performance, as observed with broadcast, so cap
# the block size if we don't have a grid-stride kernel (which would keep the grid small)
if elements_per_thread > 1
launch_configuration(kernel.fun)
else
launch_configuration(kernel.fun; max_threads=256)
end
end
The tests passed earlier because we weren't calling the right launch_heuristic from CUDA.
Something is really wrong with Metal.jl's map! on this PR:
julia> map!(-, Metal.zeros(Float32, 1), Metal.ones(Float32, 1))
1-element MtlVector{Float32, Private}:
-1.0
julia> map!(-, Metal.zeros(Float32, 2), Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
0.0
0.0
julia> Metal.zeros(Float32, 2) .= .-(Metal.ones(Float32, 2))
2-element MtlVector{Float32, Private}:
-1.0
-1.0
It doesn't seem launch configuration related because I can see 2 threads being launched here, as expected.
Map goes through launch_heuristic on Metal's side. I might have really messed up the logic somehow? AMDGPU doesn't have launch_heuristic and instead falls back to the default one defined here (in GPUArrays). It seems to be fine:
julia> map!(-, AMDGPU.zeros(Float32, 1), AMDGPU.ones(Float32, 1))
1-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
-1.0
julia> map!(-, AMDGPU.zeros(Float32, 2), AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
-1.0
-1.0
julia> AMDGPU.zeros(Float32, 2) .= .-(AMDGPU.ones(Float32, 2))
2-element ROCArray{Float32, 1, AMDGPU.Runtime.Mem.HIPBuffer}:
-1.0
-1.0
I might be tired, but those look right to me. I just pushed a commit to Metal that removes launch_heuristic. Could you try running the tests again with that branch? https://github.com/JuliaGPU/Metal.jl/pull/328
I just pushed a commit to Metal that removes
launch_heuristic. Could you try running the tests again with that branch? JuliaGPU/Metal.jl#328
That doesn't help, as the launch configuration seems correct: config = (threads = 2, blocks = 1, elements_per_thread = 1), so the launch heuristic isn't to blame (and to be sure, I made it fall back to the GPUArrays definition, which didn't change a thing).
This looks like a miscompilation in Metal.jl. I'll investigate. For a workaround: removing the for i in 1:nelem seems to help, so maybe we should get rid of the grid-stride processing here already. elements_per_thread is going to be always 1 anyway, and it would simplify the "scary" indexing in the kernel ((J-1)*nelem + j). It would probably be best to add grid-stride indexing support to KA.jl before revisiting this in the future.
Actually, found another workaround: add @inbounds to the CartesianIndices constructor.
Just to be clear, there are 2 options:
@inbounds J_c = CartesianIndices(axes(bc))[(J-1)*nelem + j]- Remove the
launch_heuristic/elements_per_threadapproach.
You are in favor of 2 with the hopes of tackling it in KA down the road?
Yeah. As I mentioned on Slack, I haven't noticed much performance improvements from grid stride loops here, and seeing how it uglifies the indexing I'd prefer we just get rid of it right now until KA.jl properly supports them.
In the case we need to do manual CartesianIndex construction though, it should be annotated @inbounds, not to avoid the Metal.jl compiler bug but to avoid the unnecessary trap branch.
Running out of time to keep debugging this today, so I'll just write everything down.
After removing the launch_heuristic and elements_per_thread approach, I am getting errors with map for certain array sizes and also in minimum!, maximum! and extrema! for different array sizes. The issue with broadcast is weird:
BoundsError: attempt to access Base.Broadcast.Broadcasted{JLArrays.JLArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{JLArrays.JLDeviceArray{ComplexF64, 2}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}} at index [1]
So it's failing to access a Broadcasted of size (2,2) at index 1. Note that the same lines in the REPL do work:
map!(+, AMDGPU.zeros(Float32, 10,10), AMDGPU.ones(Float32, 10,1))
So the errors only appear when running ] test for broadcasting.
The other issue is with minimum!, maximum! and extrema! for different array sizes. Again, I am struggling to exactly pinpoint this issue, but it happens when sz = (10, 10, 10) and red = (1, 10, 10). It doesn't break on other array sizes. To be clear sz = (10, 10, 10) and red = (1, 1, 1) seems to work.
Long story short. There's something wrong with map. I am sure it's just a stupid typo that I can't quite pick out right now:
function Base.map!(f, dest::AnyGPUArray, xs::AbstractArray...)
# custom broadcast, ignoring the container size mismatches
# (avoids the reshape + view that our mapreduce impl has to do)
indices = LinearIndices.((dest, xs...))
common_length = minimum(length.(indices))
common_length==0 && return
bc = Broadcast.instantiate(Broadcast.broadcasted(f, xs...))
if bc isa Broadcast.Broadcasted
bc = Broadcast.preprocess(dest, bc)
end
# grid-stride kernel
@kernel function map_kernel(dest, bc)
j = @index(Global, Linear)
@inbounds dest[j] = bc[j]
end
kernel = map_kernel(get_backend(dest))
config = KernelAbstractions.launch_config(kernel, common_length, nothing)
kernel(dest, bc; ndrange = config[1], workgroupsize = config[2])
if eltype(dest) <: BrokenBroadcast
throw(ArgumentError("Map operation resulting in $(eltype(eltype(dest))) is not GPU compatible"))
end
return dest
end
FYI, although I haven't had the time to look into this further, the Metal miscompilation has been fixed.
Oh, great! Tbh, I had to put this on hold for August because our daycare is closed this month and I'm juggling childcare duties. I should be able to pick it back up in September
I couldn't quite get https://github.com/JuliaGPU/GPUArrays.jl/pull/525/commits/00c8dd4912c5d1c4c4260f0b8baecf71647d52ad to work, so I reverted it to check to see if Metal would build.
It seems like now CUDA and AMDGPU (locally) both pass, but I'm not sure what's going on with Metal and oneAPI
The launch_heuristic/launch_configuration is not something that can stay, though. I have some time tomorrow and next week, so I can have a look if you want.
https://github.com/JuliaGPU/GPUArrays.jl/pull/559 is looking good (all CI green), so let's close this branch on favor of it. Huge thanks to @leios for doing the grunt work!
Congratulations @leios !