Enzyme.jl RFC: cleaning up gradient and jacobian

The code for gradient and jacobian are a bit messy right now. This PR is an attempt to move toward a clear consistent interface.

Here are some questions about this interface that I'm looking to address here:

Should this package maintain these functions in the first place?

I can see good arguments for both positions. On the one hand, there's a fair amount of specialized code in here, offloading it to another package is a little dubious. On the other hand, DifferentiationInterface.jl already wraps some lower-level methods, maybe it can achieve this more generically.

What should the return type be?

In the current code this seems a bit confused, sometimes it returns arrays, other times tuples. My approach here was to return appropriately typed and shaped arrays when the input is an array, but otherwise it leaves it as a tuple. (NOTE: I realize that, as of writing, I have deleted some of the special case handling in jacobian; I was looking for some feedback on what the intended output was before I attempt to fold in those cases.)

Functions for value and gradient?

It would be nice to have these. Unlike other autodiff packages Enzyme already has an interface for this in terms of the NoNeed types. If the desired approach of this package is to have separate functions such as value_and_gradient, a simple internal interface could be made to generalize what's here. The other alternative would be to have users explicitly pass, e.g. DuplicatedNoNeed.

Anyway, this PR is obviously not in a merge-ready state, but I was hoping to get some guidance from maintainers on what they see this looking like and whether they'd embrace a comprehensive cleanup of this, before I put much work into it. Thanks!

Aug 15 '24 22:08 ExpandingMan

Running the tests which should demonstrate that the extra Jacobian handling is needed for proper handling of multi dim in multidim out

Aug 16 '24 01:08 wsmoses

I've worked on this a bit more. By generalizing the behavior that already had unit tests, I am taking the following approach:

For a function

y_{ij} = f(x_{kl})

the jacobian is defined as the array

J_{ijkl} = \partial_{kl} y_{ij}

That is, it inherits the shape of the argument of both the input and output. In my code this generalizes to more indices in the obvious way. The index swap above is imposed to maintain consistency with existing behavior and for sake of efficient concatenation of column-major arrays.

The main place where this differs from the existing implementation is in the handling of higher-rank outputs such as matrices. I don't see these cases in current unit tests so it's not clear to me whether it's a desired behavior, though it seems to me the most straightforward generalization.

The Reverse functions are trickier so I've yet to do them.

As you may have guessed if you looked carefully, my principle motivation for doing this was that I needed special handling of static arrays which seem now fixed.

Please let me know if you have any interest in this. If you are against any changes here, my alternative for dealing with StaticArrays would probably have to be to have dedicated definitions of jacobian there.

Aug 16 '24 18:08 ExpandingMan

Just to clarify, I of course realize this is not in a remotely merge-able state, I was partially exploring here whether you are receptive to re-working these functions. I will do more work on it in the next couple of days unless you indicate to me you are not interested.

I by now have a pretty clear idea of what I'd like to happen when the output is either a scalar or an array type, generalizing this to more generic outputs is way harder. Note that, on an unrelated project, I have generalized differentiation to always work the same way regardless of the dimensions, thus unifying derivative, gradient and jacobian, a sample of which you can see here. However that's only really robust if you assume the output is either a scalar or array, so I'm not proposing to do that here. I will have to carefully work through the non-array cases.

Aug 20 '24 22:08 ExpandingMan

Can you split this PR into a non-functional refactor change, and then a functionality change?

Aug 24 '24 17:08 wsmoses

Sure, but I still have a little bit of a ways to go here before this is really in a merge-able state. I'll let you know when I feel it's ready for a full review.

I'm basically trying to make the external API changes as minimal as possible while still making the output types consistent, which should give the best case for merging this. Obviously I've started to take some liberties with the internal organization here but much of that can be easily changed (i.e. function names, files).

As of now, the most significant real change here is that gradient should return arrays when appropriate, whereas before it was mostly returning tuples (except in Reverse where I think it was the current behavior mostly). This would probably technically be breaking (although the output types as they stand are rather confused so it doesn't feel super breaking :shrug: )

Aug 24 '24 17:08 ExpandingMan

I think I've now achieved most of what I wanted to for this PR, but I still have to thoroughly go through all tests and add some more. (Also have to restore hvp which I think is simple...)

There are a few major issues that going through this has raised:

The interface for returning primal values is quite confused because of the discrepancy between forward and reverse mode. You expressed openness to changing this so I figure that'll be addressed comprehensively in some future PR.
Lack of reverse mode autodiff methods/cases seems like a problem. I attempted to replace reverse mode jacobian with existing autodiff methods but that seemed to open up a rabbit hole. Some of my attempts errored out, others returned results I couldn't make sense of. I think the fact that dedicated code is needed for jacobian is indicative of the need for more work on autodiff, as users are liable to want to be able to implement the things that would go into computing a Jacobian. It's not clear to me what exactly would need to be added. I did some poking around on my own, but I'm also not sure what parts of what you've done in jacobian are necessary for efficiency in that case, so I didn't try very hard.
I appeal to everyone to reconsider whether gradient and jacobian are required to work in general cases, such as those that do not involve AbstractArray at all. A lot of what needs to go on in those functions has to do with assembling and re-shaping the output, and it seems futile to know how that should be done with objects we've never seen before, which have completely unknown interfaces, and may or may not have their own concepts analogous to shape and duality. What's currently here might kind of work sometimes but in my opinion it's not in a remotely maintainable state (before or after this PR) and will surely come back to bite everyone if not removed.

I'll post back when tests are all passing and I'm more confident that this is something that I think ought to be merged. At that point we can address code organization issues, assuming you are interested, because again, I've begun to take some liberties with organization which I expect will be varying degrees of controversial.

Aug 26 '24 22:08 ExpandingMan

Reverse mode gradient should work for arbitrary types without any issue. Though I do agree anything where we need to do a onehot (e.g. jacobian or forward mode gradient) is intentionally limited in what types we can successfully handle through the syntactic sugar

Aug 26 '24 23:08 wsmoses

Ok, here is a little summary of test cases before and after this PR

    x = 1.0
    g1 = gradient(Forward, x -> 2*x, x)  # PR(Number); main(Number)
    g2 = gradient(Reverse, x -> 2*x, x)  # PR(Number); main(Number)
    j1 = jacobian(Forward, x -> 2*x, x)  # PR(Number); main(Number)
    j2 = jacobian(Reverse, x -> 2*x, x)  # PR(Number); main(error)
    g3 = gradient(Forward, x -> [x, 2*x], x)  # PR(Vector); main(Vector)
    #g4 = gradient(Reverse, x -> [x, 2*x], x) # PR(error in autodiff); PR(error in autodiff)
    j3 = jacobian(Forward, x -> [x, 2*x], x)  # PR(Vector); main(Vector)
    j4 = jacobian(Reverse, x -> [x, 2*x], x)  # PR(Vector); main(error)

    x = ones(2)
    g5 = gradient(Forward, x -> sum(2*x), x)  # PR(Vector); main(Tuple)
    g6 = gradient(Reverse, x -> sum(2*x), x)  # PR(Vector); main(Vector)
    g7 = gradient(Forward, x -> 2*x, x)  # PR(Matrix); main(Tuple{Vector,Vector})
    #g8 = gradient(Reverse, x -> 2*x, x)  # PR(error in autodiff); main(error in autodiff)
    j5 = jacobian(Forward, x -> sum(2*x), x)  # PR(Vector); main(Vector)
    j6 = jacobian(Reverse, x -> sum(2*x), x)  # PR(Vector); main(error)
    j7 = jacobian(Forward, x -> 2*x, x)  # PR(Matrix); main(matrix)
    j8 = jacobian(Reverse, x -> 2*x, x)  # PR(Matrix); main(error)
    
    j9 = jacobian(Forward, x -> x .* x', x)  # PR(rank 3); main(rank 3)
    j10 = jacobian(Reverse, x -> x .* x', x)  # PR(rank 3); main(error)

The only big difference (at least for relatively low rank arrays) is that gradient now returns a Vector instead of a Tuple where appropriate. This is technically breaking but I think it's fair to call it a bug since the behavior was not even consistent between Forward and Reverse. A bunch of other cases were fixed, some of them were trivial.

So I realize I have quite a lot of changes here for something which it's probably going to be somewhat of an uphill battle to justify seeing as the actual results are so similar, let me try to go through them.

Function Reorganization

The main idea here is that gradient and jacobian are really just re-arranging output from some lower level autodiff methods. I have created the derivative function which ought to do whatever gradient and jacobian do before the output gets re-arranged. It's pretty trivial but still in my opinion worthy of a function. The name was arbitrary and not necessarily intended to indicate that I believe it should be part of a public API or anything like that.

For forward mode, gradient and jacobian are now mere aliases. This is because of your requirement that these should work for arbitrary inputs. In that case I don't see why they should be separate. The gradient_output_forward and gradient_output_reverse functions are what re-arrange the output of derivative to give you something of an appropriate form. They are decidedly NOT intended to be in any way public, but may be quite useful in extensions (and are used in the StaticArrays extension).

I had intended reverse mode to be completely analogous but lack of comparable autodiff methods made that quite difficult. As I've said I think at the very least there should be better documentation for that, it's still not clear to me what reason there would be to need a dedicated gradient or jacobian for them beyond the appropriate autodiff methods merely being unavailable. What's here should be able to serve as a skeleton for an analogous treatment to forward mode if and when that's possible.

Files

In case it's not abundantly apparent from the diff, I merely shoved all the autodiff stuff into autodiff.jl and all the gradient, jacobian stuff into diffinterface.jl. That name was chosen arbitrarily, didn't necessarily mean to imply any more than a passing resemblance to DifferentiationInterface.jl.

Tests

I've added so that between what's there and what's new all my examples above should be covered.

As of writing all tests pass for me except doctests (because of some trivial changes), so I have slightly more to do.

Haven't looked at old Julia versions yet, and to be honest I'm kind of hoping 1.11 will officially land before this gets merged and we won't have to care because 1.10 will be LTS.

StaticArrays

Basically the thing that got me into this. For Forward mode, gradient and jacobian should work just fine without any allocations. Before this PR there were either spurious allocations or failures. Reverse is still broken but that's a much more difficult fix.

Sorry I know I got a little carried away here, but I really think the re-organization of the functions will make life easier going forward. I ultimately did not do anything that I thought was all that radical or controversial, so I'm hooping this effort won't seem completely off base. Obviously a lot of this stuff like function names is very easy to change. That's my pitch for it, thanks!

Aug 27 '24 23:08 ExpandingMan

I just came across this PR, which will probably come in handy for DifferentiationInterface! Didn't look at the code itself but I wanted to ask: would you mind throwing in an in-place Jacobian while you're at it?

Aug 29 '24 15:08 gdalle

To make this easier to review, can this be split up into three parts?

Moving code into files [without changing contents]
Changing implementations (without breaking)
ABI breaking stuff (e.g. forward mode tuple).

3 is going to have to wait to merge until the next breaking release since we do have folks that expect that current behavior, and it would be wise to discuss with everyone to confirm.

1 and 2 should be able to go in quickly right now, but splitting them up will let me actually review the content of the change since otherwise it's more unclear what's happening if code is both moved and modified.

Aug 29 '24 15:08 wsmoses

To make this easier to review, can this be split up into three parts?

I will try my best, 2 and 3 might be a little bit tricky, but since the only blatantly breaking thing is the tuple I should be able to manage it.

In retrospect I shouldn't have moved the files, I can definitely appreciate that it makes the diffs impossible to look at.

I just came across this PR, which will probably come in handy for DifferentiationInterface! Didn't look at the code itself but I wanted to ask: would you mind throwing in an in-place Jacobian while you're at it?

For forward mode I think probably yes, but for reverse mode it'll be difficult, and I don't want to attempt it as I am only gradually becoming familiar with the internals. Some of that I think hinges on whether we are likely to have more autodiff methods... @wsmoses can you comment on whether you agree that more low level autodiff methods ought to replace the internals of reverse mode gradient and jacobian as they currently appear?

Aug 29 '24 15:08 ExpandingMan

It depends on context, which I've been waiting to review until after the non RFC change is made.

It definitely would be great to use autodiff is can to build the sugar cases. That said, using Enzyme's lower-level methods (which are used to build autodiff) is totally fine for Enzyme to do -- especially since it may provide more tuning/options than we expose in autodiff for ease to users.

Aug 29 '24 15:08 wsmoses

I believe this is since superceeded by API fixes on main and a separate follow up PR to reorganize files [if you'd be so kind to open that in a different PR].

Sep 18 '24 18:09 wsmoses

Yeah in my opinion there's still a lot that could be done, but certainly at this point it would make far more sense to do it in a different PR, thanks.

Sep 18 '24 18:09 ExpandingMan