ForwardDiff.jl fix: regression in non-fast scalar indexing support

fixes #759

[x] Jacobian support for GPU Arrays has been restored
[x] ForwardDiff.gradient now supports GPU Arrays

cc @ChrisRackauckas @devmotion

Aug 13 '25 23:08 avik-pal

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 90.05%. Comparing base (463e830) to head (5b8ffab).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #760      +/-   ##
==========================================
+ Coverage   89.79%   90.05%   +0.25%     
==========================================
  Files          11       12       +1     
  Lines        1039     1066      +27     
==========================================
+ Hits          933      960      +27     
  Misses        106      106

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:

:snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Aug 13 '25 23:08 codecov[bot]

In https://github.com/JuliaDiff/ForwardDiff.jl/pull/472, the seed! (etc) functions were written in a generic (non-typespecific) way that should have supported GPU arrays. This PR adds further specializations to seed! for some specific types by using extensions. But why does the previous approach not work anymore? That one seemed better since it supports non-fast scalar indexing for arrays that are not GPU arrays.

Has it been properly explored if the existing functions can be written in an alternative way that would support both fast and non-fast scalar arrays with the same generic code (which would avoid any new extensions)?

Edit: I had missed that https://github.com/JuliaDiff/ForwardDiff.jl/pull/739 reverted some of #472.

Aug 19 '25 07:08 KristofferC

Yes, on the master branch seeding is (again) performed without broadcasting. Depending on the structural array type the set of indices are not readily available in an allocation-free broadcastable form (e.g. set of uppertriangular indices for UpperTriangular) and hence I reverted the code back to iterating over indices - without considering that it would break GPU compatibility.

If we want to avoid these allocations (and the broadcasting overhead) for non-GPU arrays, I don't immediately see how this issue could be solved by a generic implementation. Possibly the amount of code duplication could be reduced by introducing a helper function or branch that based on the type of the input array switches between broadcasting and iterating (presumably defaulting to iteration?), but even in this case it would be necessary to add an extension that ensures that GPU arrays use broadcasting.

Alternatively, we could default to using broadcasting (with the additional overhead of collecting the indices), and - as an additional optimization - only use iteration for a handful of selected base array types such as Array, UpperTriangular{_,<:Matrix}, etc. This wouldn't require an extension, but if we add such optimizations we would still require both an implementation with and one without broadcasting.

What are your thoughts @KristofferC?

Aug 22 '25 09:08 devmotion

bump on this

Aug 31 '25 17:08 avik-pal

Testing this patch out with Lux.jl, it will still cause regressions in the cases where we have a wrapper over CuArray.

only use iteration for a handful of selected base array types such as Array

This seems like a good solution without causing regression on use-cases that was supported prior to #739. There's also ArrayInterface.fast_scalar_indexing but not sure how the maintainers feel about taking a dep on ArrayInterface.

Sep 01 '25 19:09 avik-pal

bump on this

Sep 04 '25 16:09 avik-pal

Can you also add tests of https://github.com/JuliaDiff/ForwardDiff.jl/pull/739 and https://github.com/JuliaDiff/ForwardDiff.jl/pull/743 for these new branches?

The GPUArray backends don't support broadcasting with wrapped arrays nicely, so those tests will mostly fail.

julia> using CUDA, LinearAlgebra

julia> x = LowerTriangular(cu(rand(Float32, 4, 4)))
4×4 LowerTriangular{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}:
 0.960887   ⋅         ⋅          ⋅ 
 0.316333  0.612238   ⋅          ⋅ 
 0.236091  0.209854  0.0883058   ⋅ 
 0.370694  0.732681  0.0111619  0.270063

julia> x[diagind(x)] .= 10

And they won't have un-assigned elements.

julia> CuArray{Float32}(undef, 2, 3)
2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
 -5.81535f-36  1.25147f7  -2.50125f-11
 -1.98624      1.84662     1.95155

julia> JLArray{Float32}(undef, 3, 4)
3×4 JLArray{Float32, 2}:
 6.771f-42  6.771f-42  9.42f-43  0.0
 6.771f-42  6.771f-42  0.0       0.0
 6.771f-42  6.771f-42  0.0       4.5657f-41

Sep 11 '25 22:09 avik-pal

@ChrisRackauckas and @devmotion we're currently a bit in version hell between Lux and Bijectors in regard to ForwardDiff (https://github.com/LuxDL/Lux.jl/issues/1530). Is the any way this can be merged?

Oct 29 '25 15:10 oschulz

Not in its current form - it seems the PR also has merge conflicts.

Oct 29 '25 15:10 devmotion