ForwardDiff.jl
ForwardDiff.jl copied to clipboard
make extract_gradient[_chunk]! GPU compatible
Fixes this:
ForwardDiff.gradient(sum, cu(rand(10))) # Error: scalar indexing
My adhoc tests suggest no performance impact on 1.8.3. for result::Vector, however, there is an extra (2 allocations: 80 bytes) for result::Matrix I think due to a reshape/view thing.
I'm not sure if there's some other form of the code that works without scalar indexing and without this allocation, but open to suggestions. Another alternative is to write some gluecode here or in CUDA.jl that handles CuArray specifically, but that seem more brittle.
I'd vote this solution is OK (pending other feedback) for the benefit of CUDA compatibility with generic code, but thats just IMO.
Codecov Report
Base: 86.83% // Head: 89.74% // Increases project coverage by +2.90% :tada:
Coverage data is based on head (
9dc9769) compared to base (6a19554). Patch coverage: 100.00% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## master #619 +/- ##
==========================================
+ Coverage 86.83% 89.74% +2.90%
==========================================
Files 9 10 +1
Lines 889 965 +76
==========================================
+ Hits 772 866 +94
+ Misses 117 99 -18
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/gradient.jl | 98.92% <100.00%> (+0.06%) |
:arrow_up: |
| src/apiutils.jl | 100.00% <0.00%> (ø) |
|
| src/ForwardDiff.jl | 100.00% <0.00%> (ø) |
|
| src/jacobian.jl | 99.34% <0.00%> (+0.03%) |
:arrow_up: |
| src/derivative.jl | 94.87% <0.00%> (+0.13%) |
:arrow_up: |
| src/partials.jl | 84.34% <0.00%> (+0.13%) |
:arrow_up: |
| src/config.jl | 87.27% <0.00%> (+0.23%) |
:arrow_up: |
| src/dual.jl | 82.77% <0.00%> (+3.01%) |
:arrow_up: |
| src/prelude.jl | 93.93% <0.00%> (+56.43%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Added a barebones set of JLArray tests.
Ok, found the magic code which gives no allocation or speed regressions vs. current master for Arrays of any dimensions and which works with CuArrays, its this:
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:index+chunksize-1), 1:chunksize) do i
@inbounds partials(T, dual, i)
end
Trick is indeed as @mcabbott you suggested to use a ReshapedArray which unlike reshape is non-allocating for Arrays (TIL). And CUDA.jl is smart enough to dispatch on view(ReshapedArray(CuArray)) and do its thing.
Squashed into a single commit. Ready on my end.
Don't want to have this lost after the holidays, could this get looked at again?
adding definitions of copyto! with Tuple to CUDA
That seems like a strech to me with CUDA tbh. Although these things aren't specified perfectly, copyto! seems aimed to be an array interface, not really for tuples. I would expect such a PR not to be accepted in CUDA, but we can try.
In the meantime, this PR has no regressions but enables GPU, so can it still be merged? And we can revist later if we get the CUDA interface?
Here's an issue with the copyto! option. How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays? Closest I can come up with is:
copyto!(result, index, ntuple(i->0, Val(chunksize)), 1, chunksize)
which is not inferred, although the Julia compiler must work some magic bc its allocation free. But its 2X slower than my solution in this PR.
So in lieu of a solution, it seems even if we get copyto!(::CuArray, ::Tuple) (which fwiw was suggested on Slack that it might be possible), its not enough to write the code here generically.
I should also note that one nice thing about the copyto! option is that it would allow the specialized copyto!(::CuArray, ::Tuple) to pass the tuple data into the kernel by-reference as opposed by-value which is whats currently happening (which has a 4K memory limit, as pointed out on Slack). I'm not sure if that's a real worry except for chunk sizes that are so large theyd be unrealistic, but wanted to point that out.
But in any case, I can't fiure out how to do it with only copyto! while avoiding regressions, c.f. the comment just above.
How do you implement extract_gradient_chunk!(::Type{T}, result::AbstractArray, y::Real, index, chunksize) so thats its inferred and allocation free for Arrays?
With fill! as done in the definition of extract_gradient! for y::Real, I would assume?
Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?
Thanks, fill! doesn't have a variant with an offset / chunk size though unless I'm mistaken?
No, not that I know of. But couldn't you fill! a view?
I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.
To mention the obvious, on Julia >= 1.9 yet another alternative would be to add e.g. GPUArraysCore (I guess that might be sufficient?) as a weak dependency and add a special method for GPU arrays there. The disadvantages would be that it would not fix Julia < 1.9.
No, not that I know of. But couldn't you fill! a view?
No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.
I still think that it would be nice if we wouldn't have to use slightly unorthodox implementations, just to be able to add implicit support for another package.
Yea, I'm sympathetic to that, also bc the by-value thing, but ultimately although the code looks a little wierd, in the case of Array its purely relying on the get/set index interface, and in the case of CuArrays, on the map! interface, which are definitely very orthodox.
the map! interface, which are definitely very orthodox.
Oh, with unorthodox I was referring rather to the fact that currently the PR removes a simple copyto! statement and adds a workaround based on ReshapedArray.
No I get that. All I'm saying is its unorthodox in terms of the usage of ReshapedArray sure, but in terms of the underlying native code, for Arrays, its compiled to nearly identical code as before this PR, since either this or the old for-loop all ultimately end up at setindex!(::Array, ..).
No, thats allocating for multidimensional Arrays, which is the entire issue that led to the solution currently in this PR.
We could use fill! + Base.ReshapedArray :smile: That seems to be faster than the current for-loop:
julia> function h2(::Type{T}, result, x, index, chunksize) where {T}
fill!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), zero(x))
return result
end
h2 (generic function with 1 method)
julia> @btime h2(Float64, $(rand(100, 2)), 0f0, 20, 15);
2.570 ns (0 allocations: 0 bytes)
julia> function h3(::Type{T}, result, x, index, chunksize) where {T}
offset = index - 1
for i in 1:chunksize
result[i + offset] = zero(x)
end
return result
end
h3 (generic function with 1 method)
julia> @btime h3(Float64, $(rand(100, 2)), 0f0, 20, 15);
5.338 ns (0 allocations: 0 bytes)
Sorry I'm confused, if you're ok with ReshapedArray there, whats the issue with the current PR?
Well, I still think that 1) replacing the copyto! and dispatching to the chunk size-based implementation and 2) the map!+ReshapedArray are both very unnatural workarounds for a very specific package and use case. But of course I wanted to benchmark Base.ReshapedArray for the example. BTW, as a comparison the code in the PR seems to perform worse (than both examples):
julia> function h4(::Type{T}, result, x, index, chunksize) where {T}
map!(view(Base.ReshapedArray(result, (length(result),), ()), index:(index + chunksize - 1)), 1:chunksize) do _
return zero(x)
end
return result
end
h4 (generic function with 1 method)
julia> @btime h4(Float64, $(rand(100, 2)), 0f0, 20, 15);
10.412 ns (0 allocations: 0 bytes)
(I'm not too surprised that it's easier for the compiler to optimize the for loop than the map!.)
So it seems the PR actually introduces a regression in its current form?
I had benchmarked some full gradient calls, and there I saw essentially no difference (presumably extracting the gradient is not a dominant part of the computation except for trivial functions). But you're right it looks like the extract_gradient_chunk! code alone is a little slower for y::Real, but also try the case y::Dual, on my machine its only like 30% slower (do the exact code in the PR):
# PR
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
10.052 ns (0 allocations: 0 bytes)
# master
julia> @btime ForwardDiff.extract_gradient_chunk!(Nothing, $(rand(100, 2)), $(ForwardDiff.Dual(rand(16)...)), 20, 15);
8.530 ns (0 allocations: 0 bytes)
However, copyto! is like 5X slower than that on my machine, so thats not actually an alternative.
So at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU.
@marius311 is this still being worked on?
Was this not merged earlier due to a regression on presently working cases? If so, then we can move this implementation to an extension package ForwardDiffGPUArraysExt which defines ForwardDiff.extract_gradient_chunk!(::Type{T}, ::AbstractGPUArray, ...)
My summary from what I remember is what I said above:
at the moment I guess it boils down to a small speed regression in extract_gradient_chunk which should negligibly impact realistic gradient calls in exchange for having this package work on GPU
I unfortunately ran out of time to dedicate to responding to the requests here but anyone is absolutely welcome to finish it off or close it and do your extension package solution.
would that be a workable solution in your opinion, @devmotion?
@devmotion
@mcabbott what do you think about moving this change to a GPUArrays ext?