StochasticOptimization.jl Automatic Differentiation tools

I think it would be smart to attempt to use the great tools already created if they can fit our needs, or improved to fit our needs. Here's a summary of what I hope to see in JuliaML, and my first impressions of the options.

I want:

simple macros/methods to take a complex function and produce a type-stable and fast gradient calculation of all free/learnable parameters
simple definition of new "primitives": losses, penalties, transformations, etc for inclusion in AD
clean underlying internals (computation graph) so that we can:
- parallelize
- plot
- evolve the structure dynamically

AutoGrad.jl (@denizyuret)

I get the impression that this is very flexible and feature full, but the speed worries me. It seems that the function g in g = grad(f) actually contains a forward pass, and graph construction, and a backward pass all in one. Clearly this is not optimal if the graph structure is unchanged in each iteration. If there's a clean way to separate graph construction from a backward gradient calculation, then this package might be really nice. I don't know enough about it to comment on the flexibility of the internal representation though.

ForwardDiff (@jrevels @lbenet)

Doesn't seem practical for anything beyond toy problems. Unless I'm missing something?

ReverseDiffSource

My gut is that it will be possible to evolve, manipulate, and plot the underlying graph, as it's explicitly built (as opposed to using traces). I worry about the type stability... I think I remember reading about looping through Vector{Any} objects in hot loops.

ReverseDiffOverload

Last commit was 2 years ago

ReverseDiffSparse (@mlubin)

I don't know much about this yet

Please, add your thoughts and opinions.

Aug 29 '16 13:08 tbreloff

I should note that this issue might be more appropriate in ObjectiveFunctions.jl, but we can have the discussion here

Aug 29 '16 13:08 tbreloff

JuMP (via ReverseDiffSparse) is probably the fastest and most reliable reverse-mode implementation in julia at this point. It's also the only AD tool I'm aware of that does sparse Hessians. The downside is that you have to rewrite your functions into JuMP syntax. Your decision if it's worth the tradeoff.

Aug 29 '16 14:08 mlubin

Thanks for the comment @mlubin. Do you think that's true for all types of functions? Dense and Sparse? What would you say are the biggest limitations of the package (aside from syntax)?

Aug 29 '16 14:08 tbreloff

Functions with dense derivatives work okay as well. If you're just asking for the gradient then you'll get it back at a small constant times the cost of evaluating the function, as reverse mode promises. @jeff-regier recently did some tests on a pretty substantial code base and found that JuMP's reverse mode was about 10x slower than manually coded derivatives.

Aug 29 '16 14:08 mlubin

TaylorSeries.jl does compute jacobians and hessians as well. Yet, this is not the primary intention of the package, though I wouldn't classify it as being unable to do anything beyond toymodels (nor ForwarDiff.jl). TaylorSeries.jl is intended to compute higher-order derivatives of one- or many-independent variable functions through recursion formulas, e.g., to make accurate integrations of ODEs.

From the description of what you would like a package to do it is probably not ready; the API is rather primitive, though it has good performance.

Aug 29 '16 15:08 lbenet

There is also this: https://github.com/jrevels/ReverseDiffPrototype.jl

Aug 29 '16 16:08 datnamer

Since @datnamer brought it up: Yeah, I'm working on a reverse-mode AD package for native Julia code. It's pretty feature complete as it is, and I'd say the performance is really promising, but there is still a good bit of testing/documentation/optimization to do before it's release-ready.

[Forward-mode AD] doesn't seem practical for anything beyond toy problems. Unless I'm missing something?

In general, forward-mode is faster than reverse-mode for differentiating functions whose domains contain fewer dimensions than their range, and is faster in practice for a lot of elementary scalar functions. It's also useful in cases where reverse-mode would require more memory than is feasible. Furthermore, it can be leveraged as a component of reverse-mode implementations (both JuMP and ReverseDiffPrototype perform intermediary forward-mode AD during their forward pass).

It seems that the function g in g = grad(f) actually contains a forward pass, and graph construction, and a backward pass all in one. Clearly this is not optimal if the graph structure is unchanged in each iteration.

This is a necessity if you need to be able differentiate arbitrary code, in which the graph structure could be dependent on input values. With https://github.com/jrevels/ReverseDiffPrototype.jl, I plan on eventually allowing users to easily specify that their functions are data-independent so that the graph can be cached (and further down the line, unrolled and compiled to the GPU).

Aug 29 '16 17:08 jrevels

also relevant: https://github.com/dfdx/Espresso.jl

Aug 29 '16 17:08 Evizero

Thanks Christof. Espresso is really similar to what I was building in Transformations. I have a lot to review!!

On Monday, August 29, 2016, Christof Stocker [email protected] wrote:

also relevant: https://github.com/dfdx/Espresso.jl

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaML/StochasticOptimization.jl/issues/1#issuecomment-243192416, or mute the thread https://github.com/notifications/unsubscribe-auth/AA492vmez1v7-YUbWTu_QLNw9w3Pfbe4ks5qkxY2gaJpZM4JvfqN .

Aug 29 '16 17:08 tbreloff

@jrevels "Toy problems" sounds worse than I intended. I mean more that I'd like a solution that can be scaled to complex deep learning models with many free parameters. One would presumably have a complex model where the learnable parameters are either SubArrays of a large parameter vector, or the main parameter vector is a CatView. I want super fast ways of populating a gradient vector that corresponds to the partial derivatives wrt each param, which could then be used by some solver/learner to update.

For many problems, the gradient formula doesn't change from one iteration to the next. If this is as fast as possible, I'll probably be happy. It seems as though your prototype and Espresso are attempting to tackle this, which is great news.

If it helps, this is the sort of design that I have in my head. Method/macro names can be changed... I'm just trying to illustrate concepts:

Each op/method/transformation has an input dimension I and an output dimension O. These are fixed relative to each other, and are parameters of the type which is generated.
The @op macro creates a new type and constructor(s) which will build instances of this type. So a Affine(10,5) is actually constructing an Affine{1,1}(10, 5) where:

size(x) == (10,)
size(w) == (5, 10)
size(b) == (5,)

The idea is that the code for the forward and backward passes depends on the types/dimensions... not on the actual sizes involved. This implies we could build a "computation graph blueprint" for both the forward and backward passes, and cache/reuse it for anything with the same dimensions (it's the same code whether the matrix is 5x10 or 10x10). One could chain together blueprints to make more complex blueprints, and many/most of the dimension parameters could be inferred/set by incoming and outgoing nodes.

This design seems nice because you could construct new graphs and subgraphs, add to or delete from existing graphs, or swap out sections of graphs, all without recomputing the gradient function or recompiling anything.

I get the feeling that you can only accomplish this design by working with raw AST to generate specialized types and methods. It seems Espresso might have some of the needed toolset... I'll need to review the other options in more detail to have a better opinion.

Thanks for all the comments, and the great packages. I'd love to hear thoughts on this design (and please be brutal as long as you keep it constructive ;) If you want to see some of my initial experiments, check out this

Aug 29 '16 19:08 tbreloff

StochasticOptimization.jl StochasticOptimization.jl copied to clipboard

Automatic Differentiation tools

I want:

AutoGrad.jl (@denizyuret)

ForwardDiff (@jrevels @lbenet)

ReverseDiffSource

ReverseDiffOverload

ReverseDiffSparse (@mlubin)

StochasticOptimization.jl
StochasticOptimization.jl copied to clipboard