Optim.jl Optim.retract differs from Manifolds.retract, maybe?

Optim.retract differs from Manifolds.retract, maybe?

Open dehann opened this issue 3 years ago • 40 comments

Manifolds and Optim walk into a bar asking for Spheres. Manifolds says the norm is to project! https://github.com/JuliaManifolds/Manifolds.jl/blob/4c6cb43b9fce3ca4b506a00a6783aed9fc06a10f/src/manifolds/Sphere.jl#L388

Optim, goes whoa dude, retract! https://github.com/JuliaNLSolvers/Optim.jl/blob/e439de4c997a727f3f724ae76da54b9cc08456b2/src/Manifolds.jl#L69

thunk....

We (@Affie and I) are trying to figure out how to combine Optim with Manifolds, and after many hours we are starting to think that Optim is actually always projecting, however, we now think that Optim perhaps needs to change to a more general retract about point p given vector x. The most difficult thing for us right now is understanding what everyone's terminology is since all references of retract always keep saying "projecting".

Should Optim.retract! calls not rather always accept a manifold point p at which the tangent space is taken? See for example ManifoldsBase.retract(M,p,X), where I understand Manifold type M, with point p in M, and a vector X in the tangent space of M at p (e.g. X is a Lie algebra element).

Manopt.jl, also states retr at point x_k of vector s*delta.

PS, see in-place ManifoldsBase.retract!(M,q,p,X)

Optim docs add more confusion, saying:

Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,x) and retract!(M::YourManifold,g,x).

Yet the Optim.jl functions in code have the opposite parameter signatures: https://github.com/JuliaNLSolvers/Optim.jl/blob/e439de4c997a727f3f724ae76da54b9cc08456b2/src/Manifolds.jl#L1-L3

not sure how to read g (gradient / tangent?) and x (ambient vector if is an embedding?).

Also:

https://github.com/JuliaNLSolvers/Optim.jl/issues/448#issuecomment-331989428
xref JuliaRobotics/IncrementalInference.jl#1242

cc @kellertuer

May 13 '21 16:05 dehann

also cc @david-m-rosen

May 13 '21 16:05 dehann

Wow, you tackle all the points :)

The retraction topic might be a little difficult, since there especially is not the retraction, but you can phrase many operations on manifolds to be retractions.

Let's start in the beginning: What is a retraction? See for example http://sma.epfl.ch/~nboumal/book/index.html Definition 3.40: A retraction as a map from the tangent bundle to the manifold such that the restriction to one tangent space retr_p is p for the zero tangent vector and its differential at zero is the identity. This can be seen as a first ordre approximation to the exponential map.

The idea is that the exponential map might be too expensive. And on the sphere this is exactly what one does with projection: Walk into the embedding (p + X) and project back.

So first in Manifolds.jl – we implement retract(M,p,X,type) (and the inlace retract!(M,q,p,x,type)) similar to exp (and exp! just that the last parameter additionally specifies which retraction you want to use. And you see the retraction by projection on the Sphere for example here https://github.com/JuliaManifolds/Manifolds.jl/blob/4c6cb43b9fce3ca4b506a00a6783aed9fc06a10f/src/manifolds/Sphere.jl#L423.

So – what does Optim do? The anyways have always a + in their algorithms, so the surround this by manifold-ideas (not completely mathematically rigorous, but + and then projection – keeps you on the manifold, if there is a projection available). Since they have the + anyways they – with a slight misuse of notation - have retract to just project (again: plus is already done in the algorithms themselves) – so yes they accept arbitrary points in the embedding, but the idea is they come form some operation p+X. This does not work for all manifolds, but if you want to do Manifolds and Optimization more rigorous, consider Manopt.jl

For CG in Manopt.jl, keep in mind that

x is a point on the manifold
s is a stepsize (scalar)
δ is some descent direction. and you can again also do exp(M, x, s*δ ) (there is actually the ExponentialRetraction to do exactly this=.

Summary (i.e. the tl;dr)

As phrased here https://github.com/JuliaManifolds/Manifolds.jl/issues/35#issuecomment-569449334 „Manopt.jl is there for more serious manifoldization.“ – with all my bias (as developer of Manopt.jl) and phrased carefully: If you have implemented something using Optim and want to shortly check a Manifold idea (with one of their Manifolds) – do it. If you want to do the serious optimization on Manifolds (similar to Manopt in Matlab or pymantop in Python) – and be able to use any manifold from here – please use Manopt.jl. If you are missing an algorithm, let me know.

May 13 '21 17:05 kellertuer

Great to see this taken up! I wanted to do it at some point but was waiting on @pkofod 's fabled rewrite :-)

Should Optim.retract! calls not rather always accept a manifold point p at which the tangent space is taken?

I made this choice because it was simpler to modify the existing code in this way. Optim's retract (x+d) is manifolds' retract(x, d), and Optim's retract(x) is manifold's project(x). This is motivated by the fact that optimization algorithms are not very sensitive to the choice of retraction, and that optim only deals with manifolds as constraints (ie embedded manifolds). As far as I know there's no standard terminology on these points.

Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,x) and retract!(M::YourManifold,g,x).

That's a bug, my bad! Should be the opposite signature of course.

May 13 '21 17:05 antoine-levitt

This is motivated by the fact that optimization algorithms are not very sensitive to the choice of retraction

Uh, they are. I just spent 2 hours today with a student with some numerical instabilities, until we exchanged the retraction with one a little less good in approximation (of exp) but far better in stability and the algorithm ran smoothly (here will be a Trust Region with Approximate Hessian-SR1-Updates soon-ish).

May 13 '21 17:05 kellertuer

Do you mean that you have two mathematically well-defined retractions and one performs repeatedly and noticeably better than the other? If so I'd be interested in that. Are you sure you don't mean that one implementation of a retraction was numerically unstable and you switched to a numerically more stable one?

May 13 '21 17:05 antoine-levitt

I actually mean two retractions, both well-defined and such. So for example on the Sphere, you have

exp (using sine and cosine following great arcs) – yes that is a retraction
summation and projection

For large step things (or if you are fine stopping early) bot are equally fine, but below steps of size 1e-7 exp is too unstable so one should use ProjectionRetraction.

The same holds even more if you take into account vector transports and their interplay with retractions (i.e. preferably take a vector transport by differentiated retraction of the same retraction for the best experience).

May 13 '21 17:05 kellertuer

That's not contradictory with what I said: mathematically (in infinite precision) both are equally fine, but of course you have to take care to implement them properly (stably). I'm pretty sure you can implement the exponential retraction stably - you probably need to use tricks like trigonometric identities to transform cos(x)-1 to be able to compute it stably for x small

May 13 '21 17:05 antoine-levitt

Ie: you have to distinguish the mathematical function and the algorithm implementing it. For me "retraction" only refers to the former

May 13 '21 17:05 antoine-levitt

Mathematically you should always use the exponential map, since the retraction only approximates (up to first order) the correct way to “step in direction X” on a manifold. Retractions (as far as I am aware introduced in the Abils, Mahony, Sepulchre book) are only introduced to do the numerical algorithms and implementations, for example when there is no closed form known of the solution to the ODE defining the exponential map.

So mathematically with retractions you introduce an additional error (comparable to not waling along straight lines). For sure since those are first order approximations for “small steps” you are “fine enough”.

Concerning the concrete example: I think we already have a quite stable form, I am not aware of a numerically more stable one at the moment.

May 13 '21 17:05 kellertuer

I don't understand at all why you would say that. There's nothing sacred or "correct" about the exponential map. It's just a canonical (associated with a specific riemannian metric) way of moving around, but so what? For a given manifold, there may be several ways of moving around, of transporting tangent vectors, several connections, etc, and that's perfectly fine. Sure, Levi-Civita is nice, but it's not the only one.

When you do optimization you already approximate the structure of the objective function (using gradient and/or hessian information). Saying that you should do the exponential map is a bit like insisting that you should follow exactly the ODE x' = -nabla f(x) instead of doing gradient descent (because gradient descent introduces an "additional error" compared to the gradient flow). You don't care about following particular trajectories, you care about getting to the minimum at minimal cost. All convergence theorems I know are insensitive to the choice of a retraction (for the deep reason that the second-order geometry of the manifold does not influence the second-order properties of the objective function near a critical point), and I've never seen any practical difference between different retractions.

May 13 '21 18:05 antoine-levitt

Mathematically, if you are close enough to the minimiser, the retraction acts as good as the exponential map, sure. Maybe my statement was too strong there.

May 13 '21 18:05 kellertuer

My point is that we should not think of the exponential map as being the goal and retractions as being approximations of that goal, we should think of the exponential map as one retraction among many. I'm not an expert in differential geometry, but to me the point of the Levi-Civita is not that it's the "best connection", it's just a canonical one (that you can assign in a unique way to any riemannian manifold). Just because a choice is canonical does not mean it's not still a choice.

May 13 '21 18:05 antoine-levitt

Levi-Cevita, well it is the only torsion free connection that preserves the metric. In the sense of properties it is the best connection. So it is not just a choice, you gain a lot from that.

The properties with which retractions are introduced is, such that they approximate the exponential map. It‘s a tradeoff: There are retractions where you can state you stay very close to exp, and there are several that are easier to evaluate or faster to evaluate or more stable – hence there might be a good reasons to take them, and as you said locally for convergence that's fine. I am not saying the exp is always the goal, for sure not, but it is the operation a retraction “mimics”.

May 13 '21 18:05 kellertuer

The property of being torsion free is somewhat arbitrary. The only reason we care about torsion free is because of the uniqueness of the Levi-Civita connection: it's not that it's such a fundamental property, it's just that it is canonical.

it is the operation a retraction “mimics”.

Again I don't see it that way. The goal is to have a retraction, the exponential map just happens to provide you one. Looks lik I won't be able to convince you, but that's OK, as long as we both agree that in practice it doesn't matter much which one you choose.

May 13 '21 18:05 antoine-levitt

In practice it does matter which you choose depending on numerical stability (that ist both a) are you clever enough to do that stable and b) does there exist a stable / closed form implementation), and I agree that it is sometimes beneficial to stick to a retraction for these reasons. And in that sense it also matters (see above exp not being stable for example). For small steps in theory it does not matter, that's right (e.g. convergence).

If possible, I prefer exp, and I see I can‘t convince you on that. For the algorithms that we have in mind, that is a philosophical point I think, so it is also completely fine to see this differently.

May 13 '21 19:05 kellertuer

see above exp not being stable for example

Again, numerical stability is a property of the algorithm implementing the mathematical function, not of the mathematical function itself. What is a property of the mathematical function is conditioning, and exp is well-conditioned. Some implementations are (technically: backward) stable, some are not. I'm pretty sure you just used a formula like exp(t) = cos(t) p + sin(t) d, which is going to be bad when t becomes comparable with sqrt(eps) ~= 1e-8, but that's not the fault of exp, it's the fault of the implementation. You can eg do exp(t) = p + sin(t) d - 2 sin(t/2)^2 p, which computes the same mathematical function but does not have trouble for t small. edit: hmm, actually maybe it does. y = p + sin(t) d - 2 sin(t/2)^2 p; x = y / norm(y); is maybe better? To be checked, but anyway the point remains.

May 13 '21 19:05 antoine-levitt

I think we especially have a different understanding of practically. In your theoretical practicality, exp is well-conditioned, sure. In my practical practicality, we use https://github.com/JuliaManifolds/Manifolds.jl/blob/4c6cb43b9fce3ca4b506a00a6783aed9fc06a10f/src/manifolds/Sphere.jl#L183-L187, but that's sin(t)/t d, and maybe we should exchange that, you're right.

That does not mean that any exponential map can always be computed (with a different algorithm) arbitrarily exact. See for example the logarithmic map on Stiefel, which for now does not have a closed form solution (known until now).

So I am not sure which theoretical practicality you are referring to here. In theory, if you are close (small steps) exp/retr do the same in principle. In practice:

If X is the direction you want to go into, exp_p(X) gets you there, retr_p(X) just approximately
numerical stabilities
computational effort

are all points to be taken into account together. Retr looses the first, but might win in 2 and 3.

May 13 '21 19:05 kellertuer

Actually now that I look at it that implementation looks fine to me (as long as X is orthogonal to p to machine precision), what's wrong with it numerically?

May 13 '21 19:05 antoine-levitt

Hi wow, lots of info -- thanks! This already helps a lot. Think it's a case of diverse but minimal documentation coming from many different needs/sources.

I'm busy with some documentation examples for Manifolds.jl and will try capture some of this there to help. From my side, i'm trying to make it easier for new users to ramp up quicker, and cross checking against literature. I'm finding even the textbooks have differing opinions about names or identifying operations.

Perhaps i could add as a big fan and user of all the mentioned packages: What I like most about Manifolds.jl is a serious effort on getting general abstractions right. Dealing with data-types is a bit hard first time round though (can be fixed via more tutorial docs, where i'm working now). What I like about Optim.jl is diverse user base and history including Flux (I already have a big dependency on Optim stretching back years), the difficult bit is a deeper assumption about '+' from Linear Alg. vs generalized on-manifold increments. What I like about Manopt.jl is well abstracted on-manifolds update rules, but support for high dimension systems might not be as well developed yet (just a younger package, so all good). Over at IncrementalInference.jl we will likely support all of the above and NLsolve too. Hence my particular care in naming conventions and operations, trying to avoid type piracy etc.

May 13 '21 20:05 dehann

Actually now that I look at it that implementation looks fine to me (as long as X is orthogonal to p to machine precision), what's wrong with it numerically?

I will first thoroughly investigate why it happens and then propose a fix – as soon as I find time.

May 14 '21 05:05 kellertuer

Hi wow, lots of info -- thanks! This already helps a lot. Think it's a case of diverse but minimal documentation coming from many different needs/sources.

I will try to extend our documentation for sue :)

Perhaps i could add as a big fan and user of all the mentioned packages: What I like most about Manifolds.jl is a serious effort on getting general abstractions right.

Thanks for this nice feedback!

Dealing with data-types is a bit hard first time round though (can be fixed via more tutorial docs, where i'm working now).

For now in Manifolds.jl data types are loosely typed, and we mostly assume to have arrays (but want to allow for static arrays, too, for example), so we only type data if it is necessary for distinction. Let‘s see, how we also can document this better.

What I like about Manopt.jl is well abstracted on-manifolds update rules, but support for high dimension systems might not be as well developed yet (just a younger package, so all good).

Let me know, when I can help somewhere, high-dimension systems should be possible using Product manifolds, for example.

May 14 '21 05:05 kellertuer

Thanks for all the feedback and discussion.

I have 2 questions:

Do I understand it correctly then that: Optim.jl's retract!(M, x) is equivalent to Manifolds.jl's project(M, p) where x and p are points on the ambient space of manifold M projected to M.
I'm still not following Optim.jl's project_tangent(M,g,x), is g (gradient?) a vector in the tangent space that is mapped back (by projection) to the manifold at the point on the manifold x. Which would make it equivalent to Manifolds.jl's retract(M, p, X)

May 14 '21 16:05 Affie

I think so yes
g is a vector in the ambient space that gets projected to the tangent space at x. Nothing to do with retractions.

For instance, gradient descent is x <- retract(M, x - alpha * project_tangent(M, g, x)) where g is the gradient of the objective function at xn. Hope that helps.

May 14 '21 16:05 antoine-levitt

Concerning 1) I think so, too.

May 14 '21 18:05 kellertuer

Thanks, so for 2 it stays on the tangent space. Then it looks like the relevant Manifolds.jl function is:

project(M::Manifold, p, X)

Project ambient space representation of a vector X to a tangent vector at point p on the Manifold M.

May 14 '21 20:05 Affie

Ah, it's somewhat unfortunate that manifolds uses the same name (project) for two different operations. Wouldn't it make more sense to call project(M, p) retract(M, p)?

May 15 '21 05:05 antoine-levitt

We use the same name (project) for two things that are projections. Let me elaborate and tell the origin of the functions.

Originally we had project_tangent(M, p, X) (mutating project_tangent!(M, Y, p, X) which takes a (tangent) vector X in the embedding and projects it to the tangent space at p and we had project_point(M, p)(mutating variant project_point!(M, q, p)), which takes a point pfrom the embedding and projects it onto the manifold.

Since both are actually projections, we simplified them all to be called project. Yes if you write them all down two similar signatures occur that do something different:

project(M, p, X) projects X on to the tangent space at p
project!(M, q, p) projectts p onto M inlace computed in q.

But – they are all projections and p,q are completely different objects (points) than X (a vector).

The retraction retract(M, p, X) (or its mutating variant retract!(M, q, p, X)) on the other hand takes a point p and a tangent vector X from the tangent space at p and follows a special curve emanating from p in direction X, so it ends up at a point q on the manifold, that is roughly ||X||_p away. If you even choose retract(M, p, X, ::ExponentialRetraction) the result is exactly d(p,q) = ||X||_p away.

Mathematically there is no retraction that would work without said direction. retract(M,p) is just nothing that I see somewhere, that would mathematically make sense, because a retraction always works at a point p with a tangent vector X. Even more, since there is not the retraction, the retraction also always requires which one you mean. In Manifolds it defaults to using the exponential map if you do not provide the type, but there are quite some retractions on certain manifolds, the one we discuss here is the ProjectionRetraction() but there we have retract(M, p, X, ::ProjectionRetracion) = project(M, p+x) (at least if you have a + in the embedding, it might be more complex in other embeddings)

Or two answer 2) in short: No. project_tangent(M, p, X) (which it was called in Manifolds in the beginning) and retract(M, p, X) are not the same, the first projects (an amnient) X to give you a tangent vector at p, the second expects X to be a tangent vector and computes a point on the manifold.

May 15 '21 06:05 kellertuer

To me projecting a point on the manifold and retractions are more alike than projecting a point on the manifold and projecting a vector on the tangent space. But naming things is always tricky!

May 15 '21 11:05 antoine-levitt

But there is an addition within the retraction before you project?

The projection itself is definitely not a retraction, since a retraction is a map from the tangent space to the manifold – and not a map from the embedding to the manifold. So they are – mathematically – something completely different.

For the two projections – they are mathematically – projections, for example projection twice is the same as projections once. They are very similar.

May 15 '21 11:05 kellertuer

Oh and even more, the retraction by QR decomposition or by SVD or these (see Stiefel for example) are also very very far from projections.

May 15 '21 11:05 kellertuer

Optim.jl Optim.jl copied to clipboard

Optim.retract differs from Manifolds.retract, maybe?

Summary (i.e. the tl;dr)

Optim.jl
Optim.jl copied to clipboard