DifferentiationInterface.jl icon indicating copy to clipboard operation
DifferentiationInterface.jl copied to clipboard

Add the GTPSA.jl backend

Open mattsignorelli opened this issue 1 year ago • 17 comments

#316

mattsignorelli avatar Jun 24 '24 16:06 mattsignorelli

@mattsignorelli thank you for the contribution! I fixed a few things and defined AutoGTPSA in the main package for now. Types should not be defined in package extensions because it makes them hard to import. The main part missing is the pushforward operator. If you don't have a dedicated mechanism for that, you can always deduce it from derivative, gradient or jacobian depending on the function type (array/scalar -> array/scalar). It will be slow but it will work.

gdalle avatar Jun 25 '24 06:06 gdalle

Codecov Report

Attention: Patch coverage is 0% with 416 lines in your changes missing coverage. Please review.

Project coverage is 54.75%. Comparing base (d51fc0a) to head (77e49d6).

Files with missing lines Patch % Lines
...ace/ext/DifferentiationInterfaceGTPSAExt/onearg.jl 0.00% 287 Missing :warning:
...ace/ext/DifferentiationInterfaceGTPSAExt/twoarg.jl 0.00% 119 Missing :warning:
...face/ext/DifferentiationInterfaceGTPSAExt/utils.jl 0.00% 9 Missing :warning:
...erfaceGTPSAExt/DifferentiationInterfaceGTPSAExt.jl 0.00% 1 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (d51fc0a) and HEAD (77e49d6). Click for more details.

HEAD has 38 uploads less than BASE
Flag BASE (d51fc0a) HEAD (77e49d6)
DI 41 11
DIT 10 2
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #329       +/-   ##
===========================================
- Coverage   98.51%   54.75%   -43.76%     
===========================================
  Files         108      103        -5     
  Lines        4297     4597      +300     
===========================================
- Hits         4233     2517     -1716     
- Misses         64     2080     +2016     
Flag Coverage Δ
DI 52.44% <0.00%> (-46.14%) :arrow_down:
DIT 60.01% <ø> (-38.37%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Jun 25 '24 06:06 codecov-commenter

Sorry for the delay on this, I've had my hands tied with all different projects. Two weeks ago I pushed major updates/fixes and finishing touches to GTPSA.jl, which is now it's in first stable release. I'll get back to this shortly

mattsignorelli avatar Aug 06 '24 15:08 mattsignorelli

The structure of the tests has changed a little since the last time you checked, can you merge main into this PR?

gdalle avatar Aug 07 '24 06:08 gdalle

OK everything looks good. I had to release a new GTPSA (v1.1) to allow its jacobian to accept AbstractArray, and also define the broadcastable trait. But Back/GTPSA now passes all test_differentiation tests. I commented out the coverage tests in the yml because it was erroring due to some token thing I do not have set up on my own fork

mattsignorelli avatar Aug 08 '24 01:08 mattsignorelli

Sorry accidentally hit the close button instead of submit... time for bed

mattsignorelli avatar Aug 08 '24 01:08 mattsignorelli

I submitted a PR to ADTypes here https://github.com/SciML/ADTypes.jl/pull/78

mattsignorelli avatar Aug 08 '24 13:08 mattsignorelli

Let me just take the time to review here, I haven't yet looked at your code. If all is well on the DI front, we can add AutoGTPSA to ADTypes! Thank you so much for your contribution

gdalle avatar Aug 08 '24 13:08 gdalle

@mattsignorelli before I review, I updated the tests so that we now have full code coverage information. Can you take a look and see if you are able to increase the coverage?

One thing you need to know is that in-place operators like pushforward!(f, dy, backend, x, dx) are not tested on scenarios for which the target dy is not mutable. So the branch if y isa Number in this operator is not needed.

Another thing you may want to exploit to reduce code duplication is the redundancy between operators: derivative is just pushforward with dx = 1, gradient is just pullback with dy = 1. They all fall back onto one another inside DI, so that it is theoretically sufficient to just implement value_and_pushforward. If custom implementations of other operators yield a speed boost, then let's keep them, but if all you do in derivative is essentially call pushforward, you can get rid of it (for example).

gdalle avatar Aug 09 '24 13:08 gdalle

Thanks for looking this over!

For second order, because GTPSA does not allow nesting, separate operators must be written. For the first order operators gradient, jacobian, there is a large slowdown/significantly higher number of allocations when using only pushforward. This is because, in order to construct the jacobian or gradient, several pushforwards with different seeds have to be propagated through the function. The TPS type is mutable, and so it is much slower to repeatedly evaluate the function for different seeds, than it is to evaluate it once treating the TPSs as variables and then extract the first order derivatives. However derivative I have removed as using pushforward for that is equivalent.

I have also removed inaccessible branches (where y is not mutable) for the in-place operators, following the code coverage report, and simplified the HVP code (since the best GTPSA can do really is construct the enter hessian and multiply the vector).

The rest of the uncovered code is because the user can optionally specify a custom Descriptor, which lives in the AutoGTPSA type - this allows users to choose custom truncation orders for different variables (e.g. calculate hessian of f(x,y) but assume no $x^2$ terms in the resulting truncated power series so $\partial^2 f/\partial x^2$ is zero). This can significantly improve speed at the cost of lower accuracy. This feature is only applicable in highly specific use cases and I don't expect it to be used very often, but it should be available since GTPSA provides it. As for the code coverage in this case, the number of input variables and truncation order(s) have to be specified in the Descriptor for each tested function, and I'm not sure how to do this for every test case?

mattsignorelli avatar Aug 09 '24 19:08 mattsignorelli

Sorry for taking so long to get back to you @mattsignorelli! I have brought your PR up to speed with the rest of the package, and taken into account the new version of ADTypes which includes AutoGTPSA. The last thing I'd like to do before merging is clean up the code a little bit, by removing duplication in favor of clear auxiliary functions. For example, my last commit cuts a lot of code from pushforward thanks to

function initialize!(xt::TPS, x::Number, dx::Number)
    xt[0] = x
    xt[1] = dx
    return xt
end

function initialize!(xt::AbstractArray{<:TPS}, x::AbstractArray, dx::AbstractArray)
    for i in eachindex(xt, x, dx)
        initialize!(xt[i], x[i], dx[i])
    end
    return xt
end

Can you maybe create similar functions to shorten the other operators? Just like TPS initialization, extraction from TPS into numbers and arrays could also be streamlined.

gdalle avatar Sep 18 '24 08:09 gdalle

Thanks for your work on this!

function initialize!(xt::TPS, x::Number, dx::Number)
    xt[0] = x
    xt[1] = dx
    return xt
end

I think we need to be careful here. This would only apply for single-variable functions: xt[1] = dx basically sets the dual part of the first variable to dx. However, if there are two variables, then to set the dual part of the second variable, you would need to do xt[2] = dx. E.g.

julia> d = Descriptor(2,1); # two variables to first order

julia> f = TPS(use=d);

julia> f[0] = 1; f[1] = 2; f[2] = 3;

julia> print(f) # corresponds to f(dx1,dx2) = 1 + dx1 + dx2
TPS64:
 Coefficient                Order   Exponent
  1.0000000000000000e+00      0      0   0
  2.0000000000000000e+00      1      1   0
  3.0000000000000000e+00      1      0   1

Also, because the TPS type is mutable, we really only want to evaluate the function once. I'll make some changes to the updates and try to simplify/remove code duplication using utility functions, and push for review

mattsignorelli avatar Sep 18 '24 13:09 mattsignorelli

I think I understand what Tangents are now. Just to check: to take the pushforward of a 3 variable function with seed [dx1, dx2, dx3], your Tangents object will be Tangents{1, Vector{Float64}}(([dx1, dx2, dx3])) ? And so Tangents is there to store multiple seeds?

mattsignorelli avatar Sep 18 '24 14:09 mattsignorelli

Tangents is there to store multiple seeds?

The expression "multiple seeds" is a bit unclear. For instance, for a function f(x::AbstractVector), DI considers dx::AbstractVector to be "a single seed". It used to be possible to call pushforward(f, backend, x, dx) but that is no longer the case. From v0.6 you will need to use the wrapper pushforward(f, backend, x, Tangents(dx)), and recover a Tangents{1} as the result. On the other hand, if you have several seeds dxa, dxb, dxc (perhaps clearer than numbers?) inside Tangents(dxa, dxb, dxc), then your batch size is 3 and you will get a Tangents{3} as the result. Sorry if this is not well documented, you're the first outside person to ever contribute an extension so I didn't plan for that case ^^ But if there are unclear aspects I should definitely improve the docs.

gdalle avatar Sep 18 '24 14:09 gdalle

I think we need to be careful here. This would only apply for single-variable functions

That's the setting considered in DI. Even with all the latest changes, there will only ever be one active variable x, either in f(x) or in f!(y, x). Of course that variable can itself be an array.

gdalle avatar Sep 18 '24 14:09 gdalle

Ah ok, I think I understand better. I'm admittedly not super familiar with the usual conventions used by most AD packages in Julia, only that in GTPSA, so I'm trying to connect the dots here. Let me explain how it works in GTPSA, and then maybe you can help better see how to fit it into DI:

Basically with GTPSA, it calculates truncated power series in the variables, assuming the variables are small. So for example to calculate a Jacobian of a function $F: \mathbb{R}^n -> \mathbb{R}^m$, you would define a Descriptor(n, 1) for $n$ variables to order 1, and then use x = vars(d) which gives you a vector which in the usual AD language is actually $[\Delta x_1, \Delta x_2, ... \Delta x_n]$ ("variables" in GTPSA are small, so actually this might be better understood as dx = vars(d)). Then, to calculate the multivariable Taylor expansion of the function $F$ around some point $\vec{a}$, you would just do y = F(a.+x). y here is now the multivariable Taylor expansion. You can then extract the Jacobian from the monomial coefficients of the Taylor expansion.

Alternatively, one could do this one with variable, e.g. calculate $F([a_1+\Delta x_1, a_2, ..., a_n])$, then do $F([a_1, a_2+\Delta x_1, ..., a_n])$ etc and extract the first order part of the TPS result for each component for each calculation. However, this will be slower than instead propagating all of $F(\vec{a}+[\Delta x_1, \Delta x_2, ... \Delta x_n])$, because TPSs are mutable and we only want to calculate the function once.

mattsignorelli avatar Sep 18 '24 17:09 mattsignorelli

That makes sense, thank you. But then I think the initialize! function I wrote earlier is fine? At least it passes the tests ^^ Also, this makes me wonder if we could compute HVPs by considering both x and dx as variables somehow. With higher-level differentiation utilities like those in GTPSA, there must be something better to do than computing the full Hessian first.

gdalle avatar Sep 19 '24 07:09 gdalle

OK sorry for the big delay, I have been completely consumed by my other project but I'm ready to get back to this now.

Also, this makes me wonder if we could compute HVPs by considering both x and dx as variables somehow. With higher-level differentiation utilities like those in GTPSA, there must be something better to do than computing the full Hessian first.

That's a great idea. I'll look into this and see if I can come up with anything.

I'll take a look at the latest code and dev guide to reorient myself, and be in touch soon.

mattsignorelli avatar Dec 18 '24 13:12 mattsignorelli

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.72%. Comparing base (59dc865) to head (3473422). Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #329      +/-   ##
==========================================
+ Coverage   97.57%   97.72%   +0.15%     
==========================================
  Files         115      119       +4     
  Lines        5646     6028     +382     
==========================================
+ Hits         5509     5891     +382     
  Misses        137      137              
Flag Coverage Δ
DI 98.81% <100.00%> (+0.12%) :arrow_up:
DIT 95.39% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jan 06 '25 13:01 codecov[bot]

@gdalle This should be ready soon.. For HVP, I'm having difficulty understanding the tangents. For example one of the tests gives a 6x6 Hessian matrix but the tangent is a scalar number. Shouldn't the tangent be a vector?

mattsignorelli avatar Jan 07 '25 19:01 mattsignorelli

OK other than hvp, we are pretty much done, all other tests pass. In order to achieve full code coverage, I'm adding the kwarg unsafe_inbounds to GTPSA's getter functions, so that I can run tests with AutoGTPSA{Descriptor}, and just let the Descriptor have many variables to second order (before was just running tests with AutoGTPSA{Nothing}). Once this is released I will bump the version.

I'm having a bit of trouble with hvp. It's basically the same code that worked some number of months ago, but now it's showing disagreements. Any help or clarification on my comment above would be much appreciated :) On that note, I looked into whether or not there is a faster way to calculate the hvp without materializing the full hessian, and unfortunately because TPSs cannot be nested, there isn't really any other option than materializing the full hessian. There are some other ways I found, but none of them are faster..

mattsignorelli avatar Jan 07 '25 21:01 mattsignorelli

Actually, I realized I made a mistake in which tests are being run when I made my previous comment. I found the problem

mattsignorelli avatar Jan 08 '25 14:01 mattsignorelli

This is ready for merge now

mattsignorelli avatar Jan 08 '25 16:01 mattsignorelli

thanks, I'll take a look!

gdalle avatar Jan 11 '25 21:01 gdalle

Hi any updates on this?

mattsignorelli avatar Jan 23 '25 17:01 mattsignorelli

Note that you can commit the changes in a batch instead of one by one, by going to the "Files changed" tab

gdalle avatar Jan 23 '25 18:01 gdalle

OK thank you for your thorough review!

The only remaining points to check are:

  1. Documentation of "variables" is sufficient
  2. That gradient can take Any?
  3. Clarification on the Context question

mattsignorelli avatar Jan 23 '25 20:01 mattsignorelli

Documentation of "variables" is sufficient

Yeah, I think the link to the GTPSA docs is enough. Btw, I was wondering what the point of include_params=true is?

That gradient can take Any?

For reverse-mode backends, yes, gradient applies to an arbitrary struct. For forward-mode backends, we can only guarantee its behavior on AbstractArray (that's what ForwardDiff offers). So typing x::AbstractArray is fine.

Clarification on the Context question

Yeah I think you should replace Context with Constant everywhere

gdalle avatar Jan 23 '25 20:01 gdalle

Yeah, I think the link to the GTPSA docs is enough. Btw, I was wondering what the point of include_params=true is?

GTPSA also allows one to specify differentials for "parameters", and those can be truncated at a different order than for the "variables". The include_params kwarg just ensures that the result includes the partial derivatives for those differentials.

The distinguishing of "variables" and "parameters" is because if you have say a 3rd order map representing the transport of some particle beam through a machine ("variables" are deviations from the reference particle orbit), and then you can optimize certain higher order partial derivatives of this map using "parameters" to just first order.

For reverse-mode backends, yes, gradient applies to an arbitrary struct. For forward-mode backends, we can only guarantee its behavior on AbstractArray (that's what ForwardDiff offers). So typing x::AbstractArray is fine.

Ah, I understand now. I'll remove these lines for 100% coverage then

Yeah I think you should replace Context with Constant everywhere

OK!

mattsignorelli avatar Jan 23 '25 20:01 mattsignorelli

This should be ready to go

mattsignorelli avatar Jan 24 '25 15:01 mattsignorelli