CuArrays.jl icon indicating copy to clipboard operation
CuArrays.jl copied to clipboard

Upsampling GPU Kernel for Flux

Open avik-pal opened this issue 5 years ago • 5 comments

Current Issues:

  • [ ] Line 47 in upsample.jl is not a safe operation without atomic addition. It results in incorrect and inconsistent answers in the backward pass.

@maleadt suggested on slack to use CUDAatomics for now, but I could not get it working. Concerns regarding that

  1. It needs CuDeviceArray, I am not entirely sure how to pass CuArray to it.
  2. Should we be adding a dependency on a package which is not recommended.

Stuff to handle before it is ready for merging:

  • [ ] Open the corresponding pull requests in Flux.jl and NNlib.jl to test the complete working
  • [ ] Add tests to verify for the correctness of the code.

I am not entirely sure if this kernel belongs at this location (inside CUDNN). I can shift it to any other location if needed.

cc @MikeInnes

avik-pal avatar Feb 24 '19 08:02 avik-pal

Okay, so I had a crack at this today.

Firstly, CUDAnative itself now supports atomics, thus we can make a direct call ala

    @inbounds CUDAnative.atomic_add!(pointer(x, x_idx), y[y_idx])

For sake of interest the call to pointer (dunno where its actually defined, I guess CUDAnative), probably only does

  ptr = convert(CuArrays.CuPtr{T}, CuArrays.buffer(x, x_idx))
  CUDAnative.DevicePtr{T,AS.Global}(ptr)

However, the code itself for the ∇upsample has a two bugs: firstly, after the call to similar, we need to zero out the tensor or we add to the initial random values and, secondly, the height/width needs to be recalculated, or we have shifted indices (unless I fundamentally misunderstood how the gradients will propagate).

tl;dr:

function ∇upsample_kernel(state, y, x, height, width, channels, batch, stride, scale)
    i = @linearidx y state

    y_idx = i
    y_h = (i - 1) % (height * stride[1]) + 1
    i = (i - 1) ÷ (height * stride[1])
    y_w = i % (width * stride[2]) + 1
    i = i ÷ (width * stride[2])
    y_c = i % channels + 1
    i = i ÷ channels
    y_b = i % batch + 1

    x_h = (y_h - 1) ÷ stride[1] + 1
    x_w = (y_w - 1) ÷ stride[2] + 1
    x_c = y_c
    x_idx = (y_b - 1) * width * height * channels + (x_c - 1) * width * height + (x_w - 1) * height + x_h

    @inbounds CUDAnative.atomic_add!(pointer(x, x_idx), y[y_idx])

    return nothing
end

function ∇upsample(dy::CuArray, stride, scale = 1)
    (height, width, channels, batch) = size(dy)
    @assert height % 2 == 0
    @assert width % 2 == 0
    dx = similar(dy, (height ÷ stride[1], width ÷ stride[2], channels, batch))
    dx .= 0.0
    (height, width, channels, batch) = size(dx)
    gpu_call(∇upsample_kernel, dy, (dy, dx, height, width, channels, batch, stride, scale))
    return dx
end

Maybe the call to CUDAnative can be made smarter, that's simply how I got it to work locally quickly. Maybe @maleadt can give some advice. I guess the correct usage would be by doing a @atomic? I did see a recent bug https://github.com/JuliaGPU/CUDAnative.jl/issues/421 so I didnt test it for now.

One problem I can imagine with the current code. If scale, e.g., is not of the same T as CuArray elements, then the call to CUDAnative will fail (whether it should be caught beforehand/made impossible by changing the function signature of upsample - dunno).

For tests, I didn't write any yet. I manually verified what x_idx/y_idx are generated for a simple (2,2,1,1) -> (4,4,1,1) upsampling case and empirically my model where I use it (with all dims >1), it seems to work (comparing to the alternative of abusing a ConvTranspose as upsampling layer).

However, I am happy to contribute some test cases as well soon. Let's crack down on some stale PR :)

dbadrian avatar Sep 02 '19 14:09 dbadrian

Great! Please do open a PR and also take a look at @atomic from CUDAnative.

vchuravy avatar Sep 02 '19 16:09 vchuravy

Okay, will do so soon (currently semi-homeless, so a tad bit difficult to get things done these days).

Also just took a look at the code in https://github.com/FluxML/NNlib.jl/pull/95 by pshask, which seems to work as well, including on CuArrays. Quick test didn't yield big speed differences, but need to do some proper benchmarking. Guess, based on the results, one solution should be picked. Although his works on both Cu/Arrays, which is neat and clean.

dbadrian avatar Sep 02 '19 17:09 dbadrian

Maybe the call to CUDAnative can be made smarter, that's simply how I got it to work locally quickly. Maybe @maleadt can give some advice. I guess the correct usage would be by doing a @atomic? I did see a recent bug JuliaGPU/CUDAnative.jl#421 so I didnt test it for now.

To clarify, @atomic disable bounds checking, so that's not an issue in your case. It's just syntactic sugar though, so it's perfectly fine to just call atomic_add! explicitly.

maleadt avatar Sep 03 '19 14:09 maleadt

Thanks for working on this! A fully functional upsampling layer will be a very important addition to Flux.

sdewaele avatar Jan 30 '20 13:01 sdewaele