Scalar indexing error when computing Jacobian with a `CUDA.CuArray`
Consider the following minimal reproducible example:
using CUDA, ForwardDiff
x = cu(ones(3))
f(x) = x.^2
ForwardDiff.jacobian(f, x)
Result on ForwardDiff 1.0.0:
julia> ForwardDiff.jacobian(f, x)
3×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
2.0 0.0 0.0
0.0 2.0 0.0
0.0 0.0 2.0
which is obviously correct.
On ForwardDiff 1.0.1:
ERROR: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore should be avoided.
If you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
to enable scalar iteration globally or for the operations in question.
Issue introduced in: https://github.com/JuliaDiff/ForwardDiff.jl/pull/739
Unfortunate but it is challenging to not accidentally break GPU support as long as it is completely untested - currently the package does not run a single test on a GPU, and I'm not even sure if there's any official claim regarding GPU compatibility in the docs.
For the issue here, two approaches seem possible: 1) Use a non-iterating fallback 2) Add overloads for GPU arrays
I agree with your points that it's not guaranteed to work with CuArrays anywhere and hence not having testing on it is a consequence. Anyhow, I would be happy to try and help.
But how?
- Since the iterating functions were introduced for matrices with a special structure (diagonal etc.), maybe a non-iterating fallback is the logical implementation here (essentially revert to state before on non-diagonal matrices)?
- Are GPU tests desirable? If yes, I could start adding a few basic tests in a PR.
Referencing https://github.com/JuliaDiff/ForwardDiff.jl/pull/634 here as well (performance impact of broadcasting vs. loops on Arrays).