ReverseDiff.jl icon indicating copy to clipboard operation
ReverseDiff.jl copied to clipboard

Using ReverseDiff as a Knet backend instead of AutoGrad

Open ilkerkesen opened this issue 7 years ago • 6 comments

Hi,

I ported Knet's MNIST example. You can see changes I made by comparing current reversediff branch with the revision two commits before.

Speed Issues

Although I am taking advantage of ReverseDiff's tape compilation feature, currently ReverseDiff is too slow compared to AutoGrad on this MLP example. Here are the results,

60.575495 seconds (1.21 M allocations: 2.609 GB, 0.44% gc time) (ReverseDiff.jl - compiled)
32.504943 seconds (5.49 M allocations: 6.813 GB, 1.42% gc time) (AutoGrad.jl)

Capabilities

  • ReverseDiff has not the ability to work with KnetArrays.
  • In Knet, we use ReLU activation (max(0,x) actually), but ReverseDiff is not able to take derivative of this operation currently.
  • In AutoGrad, we have a loss function and its first input parameter is weights bundle. It can be an array, a tuple, a dictionary or a combination of these structures. Actually, this is good, because we can use same loss function for different networks (e.g. 1 hidden layer MLP and 2 hidden layer MLP, both uses same loss function). Unlike AutoGrad, in ReverseDiff, we need to pass all parameters to the loss function.
  • I think indexing does not work for ReverseDiff. In neural networks, we heavily take advantage of indexing which brings both speed and memory improvements (the old method we were using was matrix multiplication with one-hot vectors). This is what I'm talking about,
julia> using AutoGrad

julia> using ReverseDiff

julia> using ReverseDiff: gradient

julia> f(x,y,i) = sumabs2(x[i]-y)
f (generic function with 1 method)

julia> gradient(f, (rand(3,4),1,1))
ERROR: MethodError: objects of type Int64 are not callable
 in Type at /mnt/kufs/scratch/ikesen16/.julia/somon/v0.5/ReverseDiff/src/api/Config.jl:46 [inlined]
 in Type at /mnt/kufs/scratch/ikesen16/.julia/somon/v0.5/ReverseDiff/src/api/Config.jl:37 [inlined] (repeats 2 times)
 in gradient(::Function, ::Tuple{Array{Float64,2},Int64,Int64}) at /mnt/kufs/scratch/ikesen16/.julia/somon/v0.5/ReverseDiff/src/api/gradients.jl:22

julia> gf = grad(f)
(::gradfun) (generic function with 1 method)

julia> gf(rand(3,4),1,1)
3×4 Array{Float64,2}:
 -0.916041  0.0  0.0  0.0
  0.0       0.0  0.0  0.0
  0.0       0.0  0.0  0.0
  • Optional arguments is not supported by ReverseDiff.
  • In softmax operation, we have a safer version which prevents float overflow and it takes advantage of maximum operation. However, ReverseDiff does not support maximum/minimum functions.

ilkerkesen avatar Jun 23 '17 16:06 ilkerkesen

Maybe its best to split these problems into separate issues, along with expected/actual behavior.

prcastro avatar Jul 06 '17 13:07 prcastro

Finally had time to look at this, sorry for the late response. There's a very large amount of extra, un-Julian (e.g. unnecessarily type unstable) code in your example. If we're only comparing ReverseDiff and AutoGrad, then we don't need any of the code besides the loss function (since that's the only place these packages are involved). Here's a much more palatable benchmark:

using BenchmarkTools
import AutoGrad
import ReverseDiff

#########
# Setup #
#########

function loss(w, b, x, ygold)
    ypred = tanh.(w*x .+ b)
    ynorm = ypred .- log.(sum(exp.(ypred), 1))
    -(sum(ygold .* ynorm)) / size(ygold, 2)
end

const w, b, x, y = 0.1 * rand(10,28^2), zeros(10), rand(28^2), zeros(10);
const input = (w, b, x, y);
const output = map(copy, input);

################
# benchmarking #
################

const agrad_loss∇ = AutoGrad.grad((ws, x, ygold) -> loss(ws[1], ws[2], x, y))

@btime agrad_loss∇($((w, b)), $x, $y)

const rdiff_loss∇ = ReverseDiff.compile(ReverseDiff.GradientTape(loss, input))

@btime ReverseDiff.gradient!($output, $rdiff_loss∇, $input)

A list of the changes I've made from your original benchmark:

  • use an actual BenchmarkTools harness
  • used dummy initial values and removed all the unnecessary data mangling code (since it has nothing to do with gradient performance)
  • Minimized it to a single layer evaluation. Note that ReverseDiff handles the additional layers the same way as AutoGrad, but benchmarking more than one layer for the sake of comparing gradient evaluation is pointless when a) gradient performance just scales linearly with the number of layers and b) this benchmark used a fixed number of layers anyway.
  • doing the above has the nice side effect of fixing your type-unstable layer evaluation loop
  • fixed deprecation warnings in the loss function

The output on my machine for ReverseDiff is:

julia> @btime ReverseDiff.gradient!($output, $rdiff_loss∇, $input)
  46.903 μs (4 allocations: 352 bytes)

~I could not get AutoGrad to work on Julia v0.6; maybe I messed something up?~ The latest AutoGrad master works now:

julia> @btime agrad_loss∇($((w, b)), $x, $y)
  452.553 μs (816 allocations: 98.89 KiB)

Onto addressing your points:

ReverseDiff has not the ability to work with KnetArrays.

ReverseDiff attempts to support arbitary A<:AbstractArray, and the KnetArray type is not an AbstractArray. Knet gets away with this by not caring about AD-unaware, non-generic Julia code (which is reasonable). In contrasts, ReverseDiff tries to be able to differentiate as much code as possible, even if it's not perfectly type generic, as long as the code works with a reasonable set of standard Julia types.

In Knet, we use ReLU activation (max(0,x) actually), but ReverseDiff is not able to take derivative of this operation currently.

This doesn't make sense. ReverseDiff should easily be able to do this via forward-mode AD. Maybe you found a bug - can you show me an example?

In AutoGrad, we have a loss function and its first input parameter is weights bundle. It can be an array, a tuple, a dictionary or a combination of these structures. Actually, this is good, because we can use same loss function for different networks (e.g. 1 hidden layer MLP and 2 hidden layer MLP, both uses same loss function). Unlike AutoGrad, in ReverseDiff, we need to pass all parameters to the loss function.

ReverseDiff tries to expose a simple, general API rather than a "magical", use-case specific one. The idea is that it's easier to build the latter kind of API on top of the former than it is to do the reverse. For example, AutoGrad is focused on ML, so it's just munging whatever container type it sees for differentiable state in a way ML folks are used to. ReverseDiff doesn't assume an ML use case, but somebody could easily make a container munging layer on top of ReverseDiff for ML purposes.

Additionally, it's already generally accepted in the Julia AD world that nondifferentiated parameters get passed via closures. This is actually far cleaner than the AD APIs supporting additional parameters. The only pitfall here for reverse-mode is that you can do better static optimization if you have placeholder parameters, however this isn't an arena in which AutoGrad can compete since it doesn't do any static optimization anyway.

I think indexing does not work for ReverseDiff.

This is false, you're just not using ReverseDiff's API correctly. Here's an example that mirrors what you're doing with AutoGrad:

julia> gradient(x -> f(x, 1, 1), rand(3, 4))
3×4 Array{Float64,2}:
 -1.09042  0.0  0.0  0.0
  0.0      0.0  0.0  0.0
  0.0      0.0  0.0  0.0

Note that for functions containing performance-intensive scalar indexing, ReverseDiff will generally outperform AutoGrad, since ReverseDiff does some clever persistence tricks rather than naively record getindex operations to the tape.

Optional arguments is not supported by ReverseDiff.

This is also false - you even use ReverseDiff with optional arguments in your original example! Maybe you meant something different besides "optional arguments"?

In softmax operation, we have a safer version which prevents float overflow and it takes advantage of maximum operation. However, ReverseDiff does not support maximum/minimum functions.

Once again, this doesn't make sense - AFAICT from your description, ReverseDiff should be able to handle this. It'd be great to see an example so I can debug it.


Finally, I should note that I'm not working much on the current ReverseDiff version. All my efforts are going towards Cassette, which is a new prototype of a native Julia execution tracer. Once it's done, both ReverseDiff and ForwardDiff will be totally rebuilt on top of it. If ReverseDiff doesn't meet your needs right now, we might be better off waiting until Cassette is released than to spend time enhancing ReverseDiff as it is.

jrevels avatar Jul 06 '17 22:07 jrevels

@ilkerkesen @denizyuret Any response to my above comment? I'm specifically interested seeing code for the softmax/ReLU problems that were reported. I want to make sure some new work I'm doing for the Julia 1.0 timeframe will be usable w.r.t. Knet (keeping https://github.com/denizyuret/Knet.jl/issues/144 in mind).

jrevels avatar Aug 07 '17 21:08 jrevels

I am a bit late here but I've run the benchmark. It works for me on AutoGrad's master branch and julia 0.6.

This is what I get:

@btime agrad_loss∇($(w, b), $x, $y)
432.291 μs (748 allocations: 97.39 KiB)
@btime ReverseDiff.gradient!($output, $rdiff_loss∇, $input)
41.052 μs (4 allocations: 352 bytes)

I also tested increasing the array sizes from 10 to 1000:

@btime agrad_loss∇($(w, b), $x, $y)
2.811 ms (748 allocations: 6.15 MiB)
@btime ReverseDiff.gradient!($output, $rdiff_loss∇, $input)
9.314 ms (4 allocations: 15.91 KiB)

garibarba avatar Aug 31 '17 10:08 garibarba

AutoGrad errored for me, but switching to the master branch fixed the problems. I have about the same results:

julia> @btime agrad_loss∇($((w, b)), $x, $y)
  466.866 μs (816 allocations: 98.89 KiB)
([0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
julia> @btime ReverseDiff.gradient!($output, $rdiff_loss∇, $input)
  52.445 μs (4 allocations: 352 bytes)
([0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [2.30259, 2.30259, 2.30259, 2.30259, 2.30259, 2.30259, 2.30259, 2.30259, 2.30259, 2.30259])

But AutoGrad only found the gradient with respect to (w,b), while ReverseDiff did with respect to w, b, x, and y. Using AutoGrad for all of them slowed it down to a minimum/median of 525/539 microseconds // using ReverseDiff for only w and b took a minimum/median of 47.5/47.7 microseconds.

In the example on ReverseDiff's readme, ReverseDiff was 20x faster than AutoGrad when I tried it a few days ago on another computer.

I am excited for the Cassette-based overhaul (jrevels' youtube video from JuliaCon 2017 is great), especially because of ambiguity errors of the sort:

julia> const rdiff∇f = ReverseDiff.compile(ReverseDiff.GradientTape(f, randn(80)))
ERROR: MethodError: *(::RowVector{ReverseDiff.TrackedReal{Float64,Float64,ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}},ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}}, ::ReverseDiff.TrackedArray{Float64,Float64,1,Array{Float64,1},Array{Float64,1}}) is ambiguous. Candidates:
  *(x::AbstractArray{T,2} where T, y::ReverseDiff.TrackedArray{V,D,N,VA,DA} where DA where VA where N) where {V, D} in ReverseDiff at /home/chris/.julia/v0.6/ReverseDiff/src/derivatives/linalg/arithmetic.jl:193
  *(x::AbstractArray, y::ReverseDiff.TrackedArray{V,D,N,VA,DA} where DA where VA where N) where {V, D} in ReverseDiff at /home/chris/.julia/v0.6/ReverseDiff/src/derivatives/linalg/arithmetic.jl:193
  *(rowvec::RowVector{T,V} where V<:(AbstractArray{T,1} where T), vec::AbstractArray{T,1}) where T<:Real in Base.LinAlg at linalg/rowvector.jl:170
Possible fix, define
  *(::RowVector{ReverseDiff.TrackedReal{V,D,ReverseDiff.TrackedArray{V,D,1,VA,DA}},V} where V<:(AbstractArray{T,1} where T), ::ReverseDiff.TrackedArray{V,D,1,VA,DA})

But I'll try to work around this in the meantime.

chriselrod avatar Sep 10 '17 02:09 chriselrod

Bump: Any news or decisions taken regarding the future of this?

DoktorMike avatar May 13 '18 07:05 DoktorMike