DifferentiationInterface.jl icon indicating copy to clipboard operation
DifferentiationInterface.jl copied to clipboard

Second order for vector to vector functions

Open gdalle opened this issue 1 year ago • 7 comments

At the moment I don't know if there is wide user interest, but feel free to contribute here.

See initial discussion with @timholy in https://github.com/JuliaDiff/AbstractDifferentiation.jl/pull/134

gdalle avatar Apr 24 '24 09:04 gdalle

xref https://github.com/JuliaDiff/ForwardDiff.jl/issues/61 (and quite possibly others, but I haven't searched)

timholy avatar Apr 24 '24 11:04 timholy

As far as naming goes, perhaps this could still be called Hessian. It's a little inconsistent that we give the gradient of a vector-valued function a special name, "Jacobian," but that's basically a historical artifact. The right way to think about it is that the jacobian is not a matrix: it's a 2-tensor that is both contravariant (for the "vector-valued" portion) and covariant (for the gradient portion). There's really no confusion between these, except for the fact that in Julia we handle this by adding an extra dimension and we don't use any encoding to specify which "dimension" is covariant and which is contravariant.

The second order derivative of a vector-valued function is, of course, a rank-3 tensor with one contravariant and two covariant "dimensions." Again there doesn't have to be any actual ambiguity about these things, if we encoded the meaning of the dimensions. Heck, one could even define Hessian-(contravariant)vector products just fine with no ambiguity, because there are only two candidates for that (the two covariant "dimensions") and they are symmetric.

timholy avatar May 10 '24 12:05 timholy

we give the gradient of a vector-valued function a special name, "Jacobian," but that's basically a historical artifact. The right way to think about it is that the jacobian is not a matrix: it's a 2-tensor that is both contravariant (for the "vector-valued" portion) and covariant (for the gradient portion).

To add to what you are saying, Jacobians are the matrix representation of two linear maps: pushforwards and (transposed) pullbacks, which are at the heart of forward- and reverse-mode AD (and DI). For $f: \mathcal{M} \rightarrow \mathcal{N}$, pushforwards map tangent spaces on $\mathcal{M}$ onto the corresponding tangent space on $\mathcal{N}$. Pullbacks map cotangent spaces on $\mathcal{N}$ onto corresponding cotangent spaces on $\mathcal{M}$.

I'm guessing the popularity of Jacobians comes from frameworks like JAX calling their pushforwards Jacobian-vector products and pullbacks vector-Jacobian products, even though they are implemented as matrix-free linear maps. To some degree, this is just a question of semantics.

adrhill avatar May 10 '24 12:05 adrhill

@gdalle, are you now satisfied that this isn't "weird"? I could resubmit https://github.com/JuliaDiff/AbstractDifferentiation.jl/pull/134 here.

timholy avatar May 25 '24 09:05 timholy

I understand the concept, but I've been thinking more about your specific application to optimization with a Lagrangian:

$$L: (x, \lambda) \in \mathbb{R}^n \times \mathbb{R}^m \longmapsto f(x) + \lambda^\top c(x)$$

I'm still unsure which option is more wasteful:

  • computing the joint scalar Hessian $\nabla_{(x,\lambda)}^2 L \in \mathbb{R}^{(n+m)^2}$ and keeping only the upper left corner $\nabla_x^2 L \in \mathbb{R}^{n^2}$
  • computing the vector Hessian of $\phi: x \longmapsto (f(x), c_1(x), \dots, c_m(x))$, let's call it $\overset{\text{vec}}{\nabla^2} \phi \in \mathbb{R}^{(1+m) \times n \times n}$, and then summing along the first dimension

Perhaps @amontoison would have a clue, we've been talking about this too.

gdalle avatar May 25 '24 09:05 gdalle

Either way, if you want the vector Hessian, you can get it very easily as follows:

using DifferentiationInterface
import ForwardDiff

function value_jacobian_and_vector_hessian(f, backend, x)
    y = f(x)
    J(x) = jacobian(f, backend, x)
    J, H = value_and_jacobian(J, backend, x)
    return y, J, reshape(H, eachindex(y), eachindex(x), eachindex(x)) 
end

I'm reluctant to add it to DifferentiationInterface because:

  1. It's very easy to code
  2. It's very expensive to test and document. Our test suite is the true added value of the package, and adding one more operator would require at least 500 LOCs off the top of my head.
  3. No backend natively supports this, so the implementation above is essentially optimal...
  4. ... up to preparation. That is where we could possibly improve performance. Unfortunately, preparation for second-order operators is tricky, and I haven't figured it out fully yet.

Since you've already been bitten by the last item in #252, you can look at the other issues discussing it:

  • #86

gdalle avatar May 25 '24 09:05 gdalle

computing the joint scalar Hessian ... and keeping only the upper left corner

This would be viable, but you probably wouldn't want to throw it away: Newton's method for constrained optimization consists of using a first order expansion of $\nabla_x L$ with respect to $x + \Delta x$, $\lambda + \Delta \lambda$. So the $x, \lambda$ terms are useful. The $\lambda, \lambda$ ones are not because they are zero (the Lagrangian is linear in $\lambda$).

That said, there are cases where you might still want the individual Hessians (your second proposal). For unconstrained optimization, trust-region methods are very popular, and they essentially ask "was the quadratic expansion fairly accurate?" You could imagine asking that about each individual constraint as well, in which case you might want to have them disentangled from $L$. But that's a bit more speculative.

Anyway, it's unfortunate that 2nd order is so hard. Presumably Julia is not the only ecosystem grappling with this?

timholy avatar May 25 '24 09:05 timholy

As of today's main branch, DI allows you to keep some of the arguments constant in a differentiation operator. You can thus prepare a Lagrangian Hessian with some multiplicator $\lambda_0$, and then compute it with other values $\lambda \neq \lambda_0$, which I think was your main goal here.

If this is satisfactory for constrained optimization, I don't think I'm gonna implement vector-Hessians in the near future, so I'll close this for now.

using DifferentiationInterface, LinearAlgebra
using Enzyme: Enzyme

obj(x) = sum(abs2, x)
cons(x) = diff(x) .^ 3
lag(x, λ) = obj(x) + dot(λ, cons(x))

x, λ = float.(1:4), float.(1:3);
prep = prepare_hessian(lag, AutoEnzyme(), x, Constant(rand(size(λ)...)));
hessian(lag, prep, AutoEnzyme(), x, Constant(λ))

gdalle avatar Sep 24 '24 09:09 gdalle

Sounds very exciting, looking forward to trying it!

timholy avatar Sep 24 '24 11:09 timholy

Very exciting, thanks for the great work @gdalle!

Vaibhavdixit02 avatar Sep 24 '24 11:09 Vaibhavdixit02