ForwardDiff.jl icon indicating copy to clipboard operation
ForwardDiff.jl copied to clipboard

make extract_gradient[_chunk]! GPU compatible

Open marius311 opened this issue 2 years ago • 25 comments

Fixes this:

ForwardDiff.gradient(sum, cu(rand(10))) # Error: scalar indexing

My adhoc tests suggest no performance impact on 1.8.3. for result::Vector, however, there is an extra (2 allocations: 80 bytes) for result::Matrix I think due to a reshape/view thing.

I'm not sure if there's some other form of the code that works without scalar indexing and without this allocation, but open to suggestions. Another alternative is to write some gluecode here or in CUDA.jl that handles CuArray specifically, but that seem more brittle.

I'd vote this solution is OK (pending other feedback) for the benefit of CUDA compatibility with generic code, but thats just IMO.

marius311 avatar Dec 20 '22 20:12 marius311

Codecov Report

Base: 86.83% // Head: 89.74% // Increases project coverage by +2.90% :tada:

Coverage data is based on head (9dc9769) compared to base (6a19554). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #619      +/-   ##
==========================================
+ Coverage   86.83%   89.74%   +2.90%     
==========================================
  Files           9       10       +1     
  Lines         889      965      +76     
==========================================
+ Hits          772      866      +94     
+ Misses        117       99      -18     
Impacted Files Coverage Δ
src/gradient.jl 98.92% <100.00%> (+0.06%) :arrow_up:
src/apiutils.jl 100.00% <0.00%> (ø)
src/ForwardDiff.jl 100.00% <0.00%> (ø)
src/jacobian.jl 99.34% <0.00%> (+0.03%) :arrow_up:
src/derivative.jl 94.87% <0.00%> (+0.13%) :arrow_up:
src/partials.jl 84.34% <0.00%> (+0.13%) :arrow_up:
src/config.jl 87.27% <0.00%> (+0.23%) :arrow_up:
src/dual.jl 82.77% <0.00%> (+3.01%) :arrow_up:
src/prelude.jl 93.93% <0.00%> (+56.43%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Dec 20 '22 20:12 codecov-commenter

Added a barebones set of JLArray tests.

marius311 avatar Dec 20 '22 22:12 marius311

Ok, found the magic code which gives no allocation or speed regressions vs. current master for Arrays of any dimensions and which works with CuArrays, its this:

map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
    @inbounds partials(T, dual, i)
end

Trick is indeed as @mcabbott you suggested to use a ReshapedArray which unlike reshape is non-allocating for Arrays (TIL). And CUDA.jl is smart enough to dispatch on view(ReshapedArray(CuArray)) and do its thing.

Squashed into a single commit. Ready on my end.

marius311 avatar Dec 22 '22 08:12 marius311

Don't want to have this lost after the holidays, could this get looked at again?

marius311 avatar Jan 03 '23 07:01 marius311

adding definitions of copyto! with Tuple to CUDA

That seems like a strech to me with CUDA tbh. Although these things aren't specified perfectly, copyto! seems aimed to be an array interface, not really for tuples. I would expect such a PR not to be accepted in CUDA, but we can try.

In the meantime, this PR has no regressions but enables GPU, so can it still be merged? And we can revist later if we get the CUDA interface?

marius311 avatar Jan 05 '23 21:01 marius311

Here's an issue with the copyto! option. How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays? Closest I can come up with is:

copyto!(result, index, ntuple(i->0, Val(chunksize)), 1, chunksize)

which is not inferred, although the Julia compiler must work some magic bc its allocation free. But its 2X slower than my solution in this PR.

So in lieu of a solution, it seems even if we get copyto!(::CuArray, ::Tuple) (which fwiw was suggested on Slack that it might be possible), its not enough to write the code here generically.

marius311 avatar Jan 24 '23 19:01 marius311

I should also note that one nice thing about the copyto! option is that it would allow the specialized copyto!(::CuArray, ::Tuple) to pass the tuple data into the kernel by-reference as opposed by-value which is whats currently happening (which has a 4K memory limit, as pointed out on Slack). I'm not sure if that's a real worry except for chunk sizes that are so large theyd be unrealistic, but wanted to point that out.

But in any case, I can't fiure out how to do it with only copyto! while avoiding regressions, c.f. the comment just above.

marius311 avatar Jan 25 '23 20:01 marius311

How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays?

With fill! as done in the definition of extract_gradient! for y::Real, I would assume?

devmotion avatar Jan 25 '23 20:01 devmotion

Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?

marius311 avatar Jan 25 '23 20:01 marius311

Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?

No, not that I know of. But couldn't you fill! a view?

I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.

To mention the obvious, on Julia >= 1.9 yet another alternative would be to add e.g. GPUArraysCore (I guess that might be sufficient?) as a weak dependency and add a special method for GPU arrays there. The disadvantages would be that it would not fix Julia < 1.9.

devmotion avatar Jan 25 '23 20:01 devmotion

No, not that I know of. But couldn't you fill! a view?

No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.

marius311 avatar Jan 25 '23 20:01 marius311

I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.

Yea, I'm sympathetic to that, also bc the by-value thing, but ultimately although the code looks a little wierd, in the case of Array its purely relying on the get/set index interface, and in the case of CuArrays, on the map! interface, which are definitely very orthodox.

marius311 avatar Jan 25 '23 20:01 marius311

the map! interface, which are definitely very orthodox.

Oh, with unorthodox I was referring rather to the fact that currently the PR removes a simple copyto! statement and adds a workaround based on ReshapedArray.

devmotion avatar Jan 25 '23 20:01 devmotion

No I get that. All I'm saying is its unorthodox in terms of the usage of ReshapedArray sure, but in terms of the underlying native code, for Arrays, its compiled to nearly identical code as before this PR, since either this or the old for-loop all ultimately end up at setindex!(::Array, ..).

marius311 avatar Jan 25 '23 20:01 marius311

No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.

We could use fill! + Base.ReshapedArray :smile: That seems to be faster than the current for-loop:

julia> function h2(::Type{T}, result, x, index, chunksize) where {T}
           fill!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), zero(x))
           return result
       end
h2 (generic function with 1 method)

julia> @btime h2(Float64, $(rand(100, 2)), 0f0, 20, 15);
  2.570 ns (0 allocations: 0 bytes)

julia> function h3(::Type{T}, result, x, index, chunksize) where {T}
           offset = index - 1
           for i in 1:chunksize
               result[i + offset] = zero(x)
           end
           return result
       end
h3 (generic function with 1 method)

julia> @btime h3(Float64, $(rand(100, 2)), 0f0, 20, 15);
  5.338 ns (0 allocations: 0 bytes)

devmotion avatar Jan 25 '23 20:01 devmotion

Sorry I'm confused, if you're ok with ReshapedArray there, whats the issue with the current PR?

marius311 avatar Jan 25 '23 20:01 marius311

Well, I still think that 1) replacing the copyto! and dispatching to the chunk size-based implementation and 2) the map!+ReshapedArray are both very unnatural workarounds for a very specific package and use case. But of course I wanted to benchmark Base.ReshapedArray for the example. BTW, as a comparison the code in the PR seems to perform worse (than both examples):

julia> function h4(::Type{T}, result, x, index, chunksize) where {T}
           map!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), 1:chunksize) do _
               return zero(x)
           end
           return result
       end
h4 (generic function with 1 method)

julia> @btime h4(Float64, $(rand(100, 2)), 0f0, 20, 15);
  10.412 ns (0 allocations: 0 bytes)

(I'm not too surprised that it's easier for the compiler to optimize the for loop than the map!.)

devmotion avatar Jan 25 '23 21:01 devmotion

So it seems the PR actually introduces a regression in its current form?

devmotion avatar Jan 25 '23 21:01 devmotion

I had benchmarked some full gradient calls, and there I saw essentially no difference (presumably extracting the gradient is not a dominant part of the computation except for trivial functions). But you're right it looks like the extract_gradient_chunk! code alone is a little slower for y::Real, but also try the case y::Dual, on my machine its only like 30% slower (do the exact code in the PR):

# PR
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
  10.052 ns (0 allocations: 0 bytes)

# master
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
  8.530 ns (0 allocations: 0 bytes)

However, copyto! is like 5X slower than that on my machine, so thats not actually an alternative.

So at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU.

marius311 avatar Jan 25 '23 21:01 marius311

@marius311 is this still being worked on?

vpuri3 avatar Nov 01 '23 20:11 vpuri3

Was this not merged earlier due to a regression on presently working cases? If so, then we can move this implementation to an extension package ForwardDiffGPUArraysExt which defines ForwardDiff.extract_gradient_chunk!(::Type{T}, ::AbstractGPUArray, ...)

vpuri3 avatar Nov 01 '23 20:11 vpuri3

My summary from what I remember is what I said above:

at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU

I unfortunately ran out of time to dedicate to responding to the requests here but anyone is absolutely welcome to finish it off or close it and do your extension package solution.

marius311 avatar Nov 01 '23 21:11 marius311

would that be a workable solution in your opinion, @devmotion?

vpuri3 avatar Nov 03 '23 20:11 vpuri3

@devmotion

vpuri3 avatar Nov 15 '23 14:11 vpuri3

@mcabbott what do you think about moving this change to a GPUArrays ext?

vpuri3 avatar Nov 16 '23 18:11 vpuri3