getindex for CuArray to fix repeated indices error
I spent quite a while working out where my gradients were wrong, until eventually tracking it down to the GPU repeated index issue that is mentioned in multiple places:
- https://github.com/FluxML/Zygote.jl/issues/600
- https://github.com/FluxML/Zygote.jl/issues/821
- https://github.com/JuliaGPU/CUDA.jl/issues/89
- https://github.com/JuliaGPU/CUDA.jl/issues/1019
- https://github.com/JuliaLang/julia/pull/31407
Based on a solution suggested at https://discourse.julialang.org/t/increment-elements-of-array-by-index/49694 I added a hack to get the correct gradients, which can be found in this PR. It uses scatter! from NNlib to accumulate values over multiple indices. It is considerably slower than the current version, which on the GPU gives silently incorrect and even stochastic gradients. However it is faster than moving the array off the GPU to do this step.
This isn't currently suitable for merging since it adds dependencies and isn't tested beyond the simple case. It might be useful as a starting point for a discussion on how to get this important issue fixed, though, and also might be useful as a hack to others.
I think we should be correct and avoid silent bugs before aiming at being fast. Maybe taking a fast path after a "has duplicates" check can help performance a bit?
Since Zygote doesn't depend on CUDA and NNlib I'm not sure where we should put this. Here but inside a @require? In NNlib/NNlibCUDA?
Maybe taking a fast path after a "has duplicates" check can help performance a bit?
Yes I would imagine so.
I'm not sure where we should put this
There is already a CUDA require block in lib/broadcast.jl though that still leaves the NNlib dependency. Another option is to rewrite a simple version of scatter here.
Is this something ReverseDiff could benefit from as well? If so, we could consider adding it as an rrule with RuleConfig{>:HasReverseMode} to NNlibCUDA.
I don't think this is something that can be solved well with an adjoint at this point. The forward pass value would be incorrect by this point already. Specifically, note https://github.com/JuliaGPU/CUDA.jl/issues/89 which describes why this happens. Best fixed at the CUDA.jl level based on how it is implemented.
The forward pass value would be incorrect by this point already.
Not in general, since assignment broadcasting is used in the backward pass whether or not it is used in the forward pass. For example on the current release:
using Zygote, CUDA
mul2(x) = 2x
function f(a)
v = view(a, [1, 1, 2])
sum(mul2.(v))
end
a = [1.0f0, 1.0f0, 1.0f0]
f(a) # Correct, 6.0f0
f(cu(a)) # Correct, 6.0f0
gradient(f, a) # Correct, (Float32[4.0, 2.0, 0.0],)
gradient(f, cu(a)) # Incorrect, (Float32[2.0, 2.0, 0.0],)
The gradient is correct with the change in this PR.
Best fixed at the CUDA.jl level based on how it is implemented.
The problem is that this seems difficult to fix at the CUDA level based on what @maleadt says at https://github.com/JuliaGPU/CUDA.jl/issues/89.
Can we have some benchmarks here?
So the main concern wrt. dependencies here is not CUDA (because there's already a Requires block for it in Zygote) but NNlib(CUDA). I think one way to make this more palatable would be to extract scatter!(+, dx, dy, inds) into a kernel and create a wrapper accumulation function that dispatches to this kernel when passed CuArrays. The default fallback could reuse the code in https://github.com/FluxML/Zygote.jl/blob/v0.6.33/src/lib/array.jl#L38-L39.
I added the check for unique indices. Here are some benchmarks on a NVIDIA RTX A6000 with 1000 indices:
using Zygote, CUDA, BenchmarkTools, Random
using Zygote: ∇getindex
arr = cu(rand(Float32, 1000))
inds_norep = shuffle(1:1000)
inds_norep_cu = cu(inds_norep)
inds_rep = rand(1:1000, 1000)
inds_rep_cu = cu(inds_rep)
yb = cu(rand(Float32, 1000))
# On master
@btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 11.474 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 15.839 μs (31 allocations: 10.06 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 11.380 μs (35 allocations: 2.28 KiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 16.069 μs (31 allocations: 10.06 KiB)
# With this change
@btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 23.712 μs (53 allocations: 59.36 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 35.649 μs (50 allocations: 67.16 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 3.626 ms (26013 allocations: 1.29 MiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 3.663 ms (26014 allocations: 1.29 MiB)
So it's a 2x slowdown for the previously correct case of no repeated indices due to the uniqueness check, and a 300x slowdown for the previously incorrect case of repeated indices due to the use of NNlib.scatter!.
What should we do here? I think we should take the performance hit for the correctness.
So the main concern wrt. dependencies here is not CUDA (because there's already a Requires block for it in Zygote) but NNlib(CUDA). I think one way to make this more palatable would be to extract scatter!(+, dx, dy, inds) into a kernel and create a wrapper accumulation function that dispatches to this kernel when passed CuArrays. The default fallback could reuse the code in https://github.com/FluxML/Zygote.jl/blob/v0.6.33/src/lib/array.jl#L38-L39.
Hi don't quite understand this suggestion
I meant that we could copy the scatter kernel definition from NNlibCUDA and use it as an internal function. Pretty messy, but a couple of duplicated methods isn't the end of the world.
Yes, we could do that, not ideal but seems the best alternative
@jgreener64 do you want to implement this plan?
Looking at this again, another option when the indices are not unique is to do the broadcast on the CPU and move it to the GPU. I added that in the last commit. Compared to the above it benchmarks as:
@btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 24.253 μs (57 allocations: 59.64 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 37.904 μs (53 allocations: 67.39 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 12.634 μs (21 allocations: 19.92 KiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 20.333 μs (22 allocations: 19.94 KiB)
i.e. it is fast, though I imagine it increases GPU memory requirements. It may be a good way to go here due to its simplicity.
Do these times need a CUDA.@sync? Seems a little crazy that copying to an Array could be quicker, but if it really is then that's simple. And if so, it should probably skip the check? allunique is surprisingly slow, has to make a Dict I think:
julia> @btime allunique($(shuffle!(collect(1:1000))));
min 11.958 μs, mean 20.876 μs (17 allocations, 49.14 KiB)
I'm not fully sure but with this benchmark I think it is quicker to do on CPU and copy across. Skipping the uniqueness check and always doing this, which I just added to the branch, gives:
@btime $(∇getindex(arr, (inds_norep ,)))($(yb)) # 11.843 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_norep_cu,)))($(yb)) # 19.797 μs (12 allocations: 16.31 KiB)
@btime $(∇getindex(arr, (inds_rep ,)))($(yb)) # 12.079 μs (11 allocations: 16.30 KiB)
@btime $(∇getindex(arr, (inds_rep_cu ,)))($(yb)) # 19.589 μs (12 allocations: 16.31 KiB)
This is very similar to the results on master above, but I'm unsure of the wider impact of changing every indexing operation on a GPU array as it may be slower or memory-heavy in other contexts.
Are there downstream benchmark tests that can be run (e.g. for Flux) to assess this?
Could you package up some of the test cases you're using and put them in https://github.com/FluxML/Zygote.jl/blob/master/test/cuda.jl?
https://github.com/JuliaDiff/ChainRules.jl/pull/655 implements the same copy-to-CPU idea is 98e87c3. Comments (or benchmarking) would be welcome.
Adding test cases here would also be a great idea, as they will be run on a real GPU via buildkite.
I added a simple test case that is fixed by this PR.
Yes it probably makes sense to deal with this on the ChainRules.jl side long term.
Is there a way to benchmark performance on various downstream tasks such as Flux? I'm a little worried about the consequences for GPU memory if a CPU-GPU conversion happens with every indexing step, though in my tests above it seemed okay. The approach of allowing integers/ranges in https://github.com/JuliaDiff/ChainRules.jl/pull/655 without doing a uniqueness check seems sensible.
I think this was fixed in https://github.com/JuliaDiff/ChainRules.jl/pull/655.