Manopt.jl icon indicating copy to clipboard operation
Manopt.jl copied to clipboard

A combined CostGrad structure

Open kellertuer opened this issue 3 years ago • 5 comments

Depending on some feedback from the talk yesterday:

It would be nice to have a function that can reuse parts from the computation of the cost for the gradient and vice versa.

In manopt there is such a method available ( https://www.manopt.org/tutorial.html#costdescription ) in pymanopt there is an issue open ( https://github.com/pymanopt/pymanopt/issues/65 )

A first idea would be to introduce a struct

abstract type AbstractCombinedCostAndGrad{P,T} where {P,T} end

struct CombinedCostAndGradient{P,T,F} <: AbstractCombinedCostAndGradient{P,T} where {F}
    p::P
   grad::T
   cost::F
end

and a combined function

function (gc:: CombinedCostAndGradient )(M,p)
...
end

which would compute c and rhe grad if p is different from the gc.p and store p, grad, c in the struct. The actual cost and gradient then depend on an initialised gc as

function cost(M,p)
    gc(M,p)
return gc.cost
end
function grad(M,p)
    gc(M,p)
return gc.grad
end

and one could write a constructor for all abstract gradient problems (or all that have a gradient) that if they are called with a CombinedCostAndGrad structures, these two functions would be generated automatically within the generated problem.

Then the user really only has to implement the above new struct type thing as a combined function.

kellertuer avatar Jul 27 '22 11:07 kellertuer

There is already a Julia package that has combined function and gradient calculation figured out: https://github.com/JuliaNLSolvers/NLSolversBase.jl .

mateuszbaran avatar Jul 27 '22 14:07 mateuszbaran

Thanks for the link, maybe it can also be done easier then (with their last fg! idea, though I would also have fg there I think – and I want it to easily be compatible with the current state we have here (so just with a preprocessing use the existing gradient problems here).

kellertuer avatar Jul 27 '22 14:07 kellertuer

The interesting part I have not yet written above that they cover is to only evaluate one even when having costgrad(or their costgrad!, for the non mutating one I might need an extra argument and not just a Nothing like them.

kellertuer avatar Jul 27 '22 15:07 kellertuer

When you calculate gradient using AD then you get cost "for free" as a part of the output of AD -- so it's nice to be able to use that.

mateuszbaran avatar Jul 27 '22 15:07 mateuszbaran

sure I also know a few cases where I can use that, I am planning to do this issue somewhen, I just do not see the point of using Nothing to deactivate that.

kellertuer avatar Jul 27 '22 16:07 kellertuer

I think I might have an idea based on a discussion I had with a student today, where we actually needed something like this. The idea is as follows; it would also be an approach to solve #171, since it also covers the cache case.

We introduce a

struct ProblemCache{P;G,C} <: AbstractCache
    p::P # current point
    X::T # current gradient (at p)
    value::F # current cost (at p)
    # maybe further fields.
end

that might have more fields depending on what information one wants to share. Now the cost and the gradient are functors (see #172 since they are also indicators whether they are in-place or allocating) so for example

struct MyGradient{PC <: ProblemCache} <: AbstractGradient
    pc::PC
end 

(AbstractGradient would assume the functor has both an inlace and a mutating function implemented, AbstractMutatingGradient just the mutating one and so on)

and

struct MyCost <: AbstractCost
   pc::PC
end

as a functor for the cost.

The overhead would be that one has to check in for a g = MyGradient() type with g(M, p) whether isapprox(M, p, g.pc.p) that is whether the cache is up to date or needs to be updated, but here common operations could be implemented in an

function update_cache!(::Untion{<:MyCost, <:MyGradient}, p)
...
end

so that comon operations / updates would even only be implemented once.

On the other hand this approach still keeps two separate functions of grad and cost and enable the caching we talked about.

Sure one can also do simple helpers like a

struct CostCache{P,C,F} <: AbstractCache
    p::P
    cost::C
   f::F
end

Which could wrap any function f and provide a cache for one function evaluation (or if one wants to be fancy for a certain number of the last n function evaluations. This of course would directly be a “Cost functor”.

kellertuer avatar Nov 29 '22 11:11 kellertuer

I think combined cost and gradient calculation and caching are two separate issues? By that I mean that there should be a way to have a conbined calculation without caching. Probably the simplest interface for that would be something like

abstract type AbstractObjective end

struct SeparateGradCostObjective{TF,TG} <: AbstractObjective
    f::TF
    g::TG
end

struct CombinedGradCostObjective{TFG} <: AbstractObjective
    fg::TFG
end

get_cost(::AbstractObjective, p) = ...
get_gradient(::AbstractObjective, p) = ...
get_cost_and_gradient(::AbstractObjective, p) = ... # (either calls `fg` for `CombinedGradCostObjective` or `f` and `g` for `SeparateGradCostObjective`)

# and so on with mutating variants, Hessians, multivalued objectives for nonlinear least squares, Jacobians...

mateuszbaran avatar Nov 29 '22 13:11 mateuszbaran

Well, I would say to some extend those have similar components. It might be that both the cost and the gradient might be based on a common (pre-)computation of some h(x) so if at p the cost comes first, computes h(p) then the gradient can reuse that.

This would have the advantage that the two fields cost and grad that are currently within the problem would stay, which is not possible in your sketch, since that would require a common Objective structure?

edit: or to phrase differently – if the function fg would be the total common function h above that would always “return both”, then the gradient and cost would be basically accessors to the cache, which would update the cache (so all of p, X and value) if we are at the new point.

I think this might even be a little more flexible than your approach, but sure if storing a gradient is a problem, then one would need a get_cost that computes both grad and value but throws grad away instead of storing that (that is how I understand your get_cost for the case of the combined variant). So what I miss in your sketch is, that if I call get_cost and after that get_gradient for your last struct, it would compute cost & grad twice (first throw away the gradient, second run throw away the cost).

kellertuer avatar Nov 29 '22 13:11 kellertuer

This would have the advantage that the two fields cost and grad that are currently within the problem would stay, which is not possible in your sketch, since that would require a common Objective structure?

The current two fields seem to be incompatible with cache-less combined cost-gradient computation which I'd like to have available so some rework would have to happen there. Either a combined thing or a third field for the combined computation.

I think this might even be a little more flexible than your approach, but sure if storing a gradient is a problem, then one would need a get_cost that computes both grad and value but throws grad away instead of storing that (that is how I understand your get_cost for the case of the combined variant). So what I miss in your sketch is, that if I call get_cost and after that get_gradient for your last struct, it would compute cost & grad twice (first throw away the gradient, second run throw away the cost).

In my approach there could also be something like

struct CombinedAndSeparateGradCostObjective{TF,TG,TFG} <: AbstractObjective
    f::TF
    g::TG
    fg::TFG
end

get_cost(obj::CombinedAndSeparateGradCostObjective, p) = obj.f(p)
get_gradient(obj::CombinedAndSeparateGradCostObjective, p) = obj.g(p)
get_cost_and_gradient(obj::CombinedAndSeparateGradCostObjective, p) = obj.fg(p)

and then we could have a

mutable struct CachedObjective{TObj<:AbstractObjective,TP} <: AbstractObjective
    obj::TObj
    cached_at::TP
    cached_values::Dict{Symbol,Any}
end

get_cost(obj::CachedObjective, p) # check if p is obj.cached_at and either returns value from obj.cached_values or computes a new one, deleting contents of cached_values.

which would wrap an arbitrary objective and cache values exactly as you wrote. The advantage is that we could then use combined evaluation without caching. Good caching is not easy. For example, you need an efficient cache invalidation mechanism. When you have stored objective and gradient, but you need to compute just the objective at a new point, there may be no need to compute the gradient but the old one needs to be marked as invalid (if we have a cache for a single point). Or we could have semi-separate caches for cost and gradient computation.

I'm not sure if a combined objective would be better or worse than three separate functions, my main point is that caching should be optional.

mateuszbaran avatar Nov 29 '22 16:11 mateuszbaran

Hm, that sounds valid, but makes the approach I had in mind not doable.

The other disadvantage of your approach is, that it would require a lot – really a lot a lot – of rework, since the Problems currently always have different fields for cost/grad/... and that would be joined into one, which would also change all accessors. This is a lot of work, because we to not just have cost/grad but also

  • prox
  • a set of proxes
  • a primal prox
  • a dual prox
  • a prox and a relfection
  • a Hessian
  • an approximate Hessian

with all their accessor functions which then need to be rewritten. I am not saying, we should not do that, but I will maybe remove the “good first issue”, because this large change, would more like put it on the other side of the spectrum of issues.

And yes sure, partially invalidated data was not yet in my mind, but should be doable with the dictionary you proposed (good idea!) if one stores maybe valid flags always tuples (p, grad_p) instead of just one coached_at value.

kellertuer avatar Nov 29 '22 16:11 kellertuer

Yes, right, a unified cost/grad/etc. field would be a significant rework. So maybe a better (or at least easier to implement) idea would be to introduce additional fields in specific problems for combined computations? It would be non-breaking at least so it could be introduced gradually.

mateuszbaran avatar Nov 29 '22 18:11 mateuszbaran

Hm, if that is an urgent issue then sure we could do that, otherwise that might be my new-years project maybe – them, well so eh, my last years new-years project took until April, but well will see ;)

kellertuer avatar Nov 29 '22 18:11 kellertuer

No, it's not urgent, actually the opposite. I doubt it will ever make anything more than 2x faster which is neat but rarely makes a difference :wink: . I think there are more important things that can be improved in Manopt.

mateuszbaran avatar Nov 29 '22 18:11 mateuszbaran

Oh, then feel free to open more issues – there are several parts that I wrote to learn Julia around 2015/2016, so sure that might not all be optimal performance wise, but I hope I at least gave most interfaces some thoughts.

For the gf/cache (if so this would model both), sure, that is not urgent but with our discussion I think I have a good idea, how one can do this.

kellertuer avatar Nov 29 '22 18:11 kellertuer

Ah and I think there might be complicated cost/gards where at least caching might be useful, but again, I think since that would be some rework, one might approach the combination of functions in the same PR.

kellertuer avatar Nov 29 '22 18:11 kellertuer

Oh, then feel free to open more issues – there are several parts that I wrote to learn Julia around 2015/2016, so sure that might not all be optimal performance wise, but I hope I at least gave most interfaces some thoughts.

Actually I think performance is mostly fine now, at least in the few solvers that I've used so far and for not too cheap to evaluate cost functions. A benchmark that could attract more users by showing advantages over other libraries would be great. I think we could just take any Manopt.jl example, rewrite it in Optim.jl for example as a constrained optimization problem and show how much faster Manopt.jl converges.

mateuszbaran avatar Dec 01 '22 13:12 mateuszbaran

Ah, that might be a good think as a tutorial / comparison, but if so maybe also as an honest one:

  1. Euclidean we are similar / maybe a little slower (hopefully not 😎 ) than Optim.jl
  2. Look for a god manifold example that is more efficient than the corresponding constraint optimisation.

Sounds like a good plan for a tutorial.

kellertuer avatar Dec 01 '22 13:12 kellertuer