DifferentiationInterface.jl
DifferentiationInterface.jl copied to clipboard
Add the GTPSA.jl backend
#316
@mattsignorelli thank you for the contribution!
I fixed a few things and defined AutoGTPSA in the main package for now. Types should not be defined in package extensions because it makes them hard to import.
The main part missing is the pushforward operator. If you don't have a dedicated mechanism for that, you can always deduce it from derivative, gradient or jacobian depending on the function type (array/scalar -> array/scalar). It will be slow but it will work.
Codecov Report
Attention: Patch coverage is 0% with 416 lines in your changes missing coverage. Please review.
Project coverage is 54.75%. Comparing base (
d51fc0a) to head (77e49d6).
:exclamation: There is a different number of reports uploaded between BASE (d51fc0a) and HEAD (77e49d6). Click for more details.
HEAD has 38 uploads less than BASE
Flag BASE (d51fc0a) HEAD (77e49d6) DI 41 11 DIT 10 2
Additional details and impacted files
@@ Coverage Diff @@
## main #329 +/- ##
===========================================
- Coverage 98.51% 54.75% -43.76%
===========================================
Files 108 103 -5
Lines 4297 4597 +300
===========================================
- Hits 4233 2517 -1716
- Misses 64 2080 +2016
| Flag | Coverage Δ | |
|---|---|---|
| DI | 52.44% <0.00%> (-46.14%) |
:arrow_down: |
| DIT | 60.01% <ø> (-38.37%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Sorry for the delay on this, I've had my hands tied with all different projects. Two weeks ago I pushed major updates/fixes and finishing touches to GTPSA.jl, which is now it's in first stable release. I'll get back to this shortly
The structure of the tests has changed a little since the last time you checked, can you merge main into this PR?
OK everything looks good. I had to release a new GTPSA (v1.1) to allow its jacobian to accept AbstractArray, and also define the broadcastable trait. But Back/GTPSA now passes all test_differentiation tests. I commented out the coverage tests in the yml because it was erroring due to some token thing I do not have set up on my own fork
Sorry accidentally hit the close button instead of submit... time for bed
I submitted a PR to ADTypes here https://github.com/SciML/ADTypes.jl/pull/78
Let me just take the time to review here, I haven't yet looked at your code. If all is well on the DI front, we can add AutoGTPSA to ADTypes!
Thank you so much for your contribution
@mattsignorelli before I review, I updated the tests so that we now have full code coverage information. Can you take a look and see if you are able to increase the coverage?
One thing you need to know is that in-place operators like pushforward!(f, dy, backend, x, dx) are not tested on scenarios for which the target dy is not mutable. So the branch if y isa Number in this operator is not needed.
Another thing you may want to exploit to reduce code duplication is the redundancy between operators: derivative is just pushforward with dx = 1, gradient is just pullback with dy = 1. They all fall back onto one another inside DI, so that it is theoretically sufficient to just implement value_and_pushforward. If custom implementations of other operators yield a speed boost, then let's keep them, but if all you do in derivative is essentially call pushforward, you can get rid of it (for example).
Thanks for looking this over!
For second order, because GTPSA does not allow nesting, separate operators must be written. For the first order operators gradient, jacobian, there is a large slowdown/significantly higher number of allocations when using only pushforward. This is because, in order to construct the jacobian or gradient, several pushforwards with different seeds have to be propagated through the function. The TPS type is mutable, and so it is much slower to repeatedly evaluate the function for different seeds, than it is to evaluate it once treating the TPSs as variables and then extract the first order derivatives. However derivative I have removed as using pushforward for that is equivalent.
I have also removed inaccessible branches (where y is not mutable) for the in-place operators, following the code coverage report, and simplified the HVP code (since the best GTPSA can do really is construct the enter hessian and multiply the vector).
The rest of the uncovered code is because the user can optionally specify a custom Descriptor, which lives in the AutoGTPSA type - this allows users to choose custom truncation orders for different variables (e.g. calculate hessian of f(x,y) but assume no $x^2$ terms in the resulting truncated power series so $\partial^2 f/\partial x^2$ is zero). This can significantly improve speed at the cost of lower accuracy. This feature is only applicable in highly specific use cases and I don't expect it to be used very often, but it should be available since GTPSA provides it. As for the code coverage in this case, the number of input variables and truncation order(s) have to be specified in the Descriptor for each tested function, and I'm not sure how to do this for every test case?
Sorry for taking so long to get back to you @mattsignorelli! I have brought your PR up to speed with the rest of the package, and taken into account the new version of ADTypes which includes AutoGTPSA.
The last thing I'd like to do before merging is clean up the code a little bit, by removing duplication in favor of clear auxiliary functions. For example, my last commit cuts a lot of code from pushforward thanks to
function initialize!(xt::TPS, x::Number, dx::Number)
xt[0] = x
xt[1] = dx
return xt
end
function initialize!(xt::AbstractArray{<:TPS}, x::AbstractArray, dx::AbstractArray)
for i in eachindex(xt, x, dx)
initialize!(xt[i], x[i], dx[i])
end
return xt
end
Can you maybe create similar functions to shorten the other operators? Just like TPS initialization, extraction from TPS into numbers and arrays could also be streamlined.
Thanks for your work on this!
function initialize!(xt::TPS, x::Number, dx::Number) xt[0] = x xt[1] = dx return xt end
I think we need to be careful here. This would only apply for single-variable functions: xt[1] = dx basically sets the dual part of the first variable to dx. However, if there are two variables, then to set the dual part of the second variable, you would need to do xt[2] = dx. E.g.
julia> d = Descriptor(2,1); # two variables to first order
julia> f = TPS(use=d);
julia> f[0] = 1; f[1] = 2; f[2] = 3;
julia> print(f) # corresponds to f(dx1,dx2) = 1 + dx1 + dx2
TPS64:
Coefficient Order Exponent
1.0000000000000000e+00 0 0 0
2.0000000000000000e+00 1 1 0
3.0000000000000000e+00 1 0 1
Also, because the TPS type is mutable, we really only want to evaluate the function once. I'll make some changes to the updates and try to simplify/remove code duplication using utility functions, and push for review
I think I understand what Tangents are now. Just to check: to take the pushforward of a 3 variable function with seed [dx1, dx2, dx3], your Tangents object will be Tangents{1, Vector{Float64}}(([dx1, dx2, dx3])) ? And so Tangents is there to store multiple seeds?
Tangents is there to store multiple seeds?
The expression "multiple seeds" is a bit unclear. For instance, for a function f(x::AbstractVector), DI considers dx::AbstractVector to be "a single seed". It used to be possible to call pushforward(f, backend, x, dx) but that is no longer the case.
From v0.6 you will need to use the wrapper pushforward(f, backend, x, Tangents(dx)), and recover a Tangents{1} as the result. On the other hand, if you have several seeds dxa, dxb, dxc (perhaps clearer than numbers?) inside Tangents(dxa, dxb, dxc), then your batch size is 3 and you will get a Tangents{3} as the result.
Sorry if this is not well documented, you're the first outside person to ever contribute an extension so I didn't plan for that case ^^ But if there are unclear aspects I should definitely improve the docs.
I think we need to be careful here. This would only apply for single-variable functions
That's the setting considered in DI. Even with all the latest changes, there will only ever be one active variable x, either in f(x) or in f!(y, x). Of course that variable can itself be an array.
Ah ok, I think I understand better. I'm admittedly not super familiar with the usual conventions used by most AD packages in Julia, only that in GTPSA, so I'm trying to connect the dots here. Let me explain how it works in GTPSA, and then maybe you can help better see how to fit it into DI:
Basically with GTPSA, it calculates truncated power series in the variables, assuming the variables are small. So for example to calculate a Jacobian of a function $F: \mathbb{R}^n -> \mathbb{R}^m$, you would define a Descriptor(n, 1) for $n$ variables to order 1, and then use x = vars(d) which gives you a vector which in the usual AD language is actually $[\Delta x_1, \Delta x_2, ... \Delta x_n]$ ("variables" in GTPSA are small, so actually this might be better understood as dx = vars(d)). Then, to calculate the multivariable Taylor expansion of the function $F$ around some point $\vec{a}$, you would just do y = F(a.+x). y here is now the multivariable Taylor expansion. You can then extract the Jacobian from the monomial coefficients of the Taylor expansion.
Alternatively, one could do this one with variable, e.g. calculate $F([a_1+\Delta x_1, a_2, ..., a_n])$, then do $F([a_1, a_2+\Delta x_1, ..., a_n])$ etc and extract the first order part of the TPS result for each component for each calculation. However, this will be slower than instead propagating all of $F(\vec{a}+[\Delta x_1, \Delta x_2, ... \Delta x_n])$, because TPSs are mutable and we only want to calculate the function once.
That makes sense, thank you. But then I think the initialize! function I wrote earlier is fine? At least it passes the tests ^^
Also, this makes me wonder if we could compute HVPs by considering both x and dx as variables somehow. With higher-level differentiation utilities like those in GTPSA, there must be something better to do than computing the full Hessian first.
OK sorry for the big delay, I have been completely consumed by my other project but I'm ready to get back to this now.
Also, this makes me wonder if we could compute HVPs by considering both
xanddxas variables somehow. With higher-level differentiation utilities like those in GTPSA, there must be something better to do than computing the full Hessian first.
That's a great idea. I'll look into this and see if I can come up with anything.
I'll take a look at the latest code and dev guide to reorient myself, and be in touch soon.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 97.72%. Comparing base (
59dc865) to head (3473422). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #329 +/- ##
==========================================
+ Coverage 97.57% 97.72% +0.15%
==========================================
Files 115 119 +4
Lines 5646 6028 +382
==========================================
+ Hits 5509 5891 +382
Misses 137 137
| Flag | Coverage Δ | |
|---|---|---|
| DI | 98.81% <100.00%> (+0.12%) |
:arrow_up: |
| DIT | 95.39% <ø> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@gdalle This should be ready soon.. For HVP, I'm having difficulty understanding the tangents. For example one of the tests gives a 6x6 Hessian matrix but the tangent is a scalar number. Shouldn't the tangent be a vector?
OK other than hvp, we are pretty much done, all other tests pass. In order to achieve full code coverage, I'm adding the kwarg unsafe_inbounds to GTPSA's getter functions, so that I can run tests with AutoGTPSA{Descriptor}, and just let the Descriptor have many variables to second order (before was just running tests with AutoGTPSA{Nothing}). Once this is released I will bump the version.
I'm having a bit of trouble with hvp. It's basically the same code that worked some number of months ago, but now it's showing disagreements. Any help or clarification on my comment above would be much appreciated :) On that note, I looked into whether or not there is a faster way to calculate the hvp without materializing the full hessian, and unfortunately because TPSs cannot be nested, there isn't really any other option than materializing the full hessian. There are some other ways I found, but none of them are faster..
Actually, I realized I made a mistake in which tests are being run when I made my previous comment. I found the problem
This is ready for merge now
thanks, I'll take a look!
Hi any updates on this?
Note that you can commit the changes in a batch instead of one by one, by going to the "Files changed" tab
OK thank you for your thorough review!
The only remaining points to check are:
- Documentation of "variables" is sufficient
- That
gradientcan takeAny? - Clarification on the
Contextquestion
Documentation of "variables" is sufficient
Yeah, I think the link to the GTPSA docs is enough. Btw, I was wondering what the point of include_params=true is?
That gradient can take Any?
For reverse-mode backends, yes, gradient applies to an arbitrary struct. For forward-mode backends, we can only guarantee its behavior on AbstractArray (that's what ForwardDiff offers). So typing x::AbstractArray is fine.
Clarification on the Context question
Yeah I think you should replace Context with Constant everywhere
Yeah, I think the link to the GTPSA docs is enough. Btw, I was wondering what the point of
include_params=trueis?
GTPSA also allows one to specify differentials for "parameters", and those can be truncated at a different order than for the "variables". The include_params kwarg just ensures that the result includes the partial derivatives for those differentials.
The distinguishing of "variables" and "parameters" is because if you have say a 3rd order map representing the transport of some particle beam through a machine ("variables" are deviations from the reference particle orbit), and then you can optimize certain higher order partial derivatives of this map using "parameters" to just first order.
For reverse-mode backends, yes,
gradientapplies to an arbitrary struct. For forward-mode backends, we can only guarantee its behavior onAbstractArray(that's what ForwardDiff offers). So typingx::AbstractArrayis fine.
Ah, I understand now. I'll remove these lines for 100% coverage then
Yeah I think you should replace
ContextwithConstanteverywhere
OK!
This should be ready to go