theseus icon indicating copy to clipboard operation
theseus copied to clipboard

Does tutorial 2 use the Theseus derivatives through the NLLS? (Or just through the objective?)

Open bamos opened this issue 2 years ago • 10 comments

This tutorial parameterizes a quadratic ax^2+b with a optimized by PyTorch autograd and b optimized with the Theseus NLLS for a given a. The key piece that enables a to be learned is that we pass it back into the same cost function the NLLS optimizer uses except we take a gradient step of the cost function w.r.t. a, which doesn't use the derivative information of how b was computed through the NLLS optimizer: image

Thus if I understand correctly, this tutorial isn't using the derivatives through the NLLS optimization process. To try to understand this better, I added a torch.no_grad call around the NLLS optimizer to block the gradients through it and it didn't change the output and it was still able to fit the quadratics: image image

\cc @vshobha @mhmukadam @luisenp

bamos avatar Dec 12 '21 01:12 bamos

In this particular example, the different functions (with the different b values) are independent. I think because of this the optimization does not need to backpropagate through the NLLS. I'm guessing in a more complex example, like the motion planning and the tactile estimation examples, it would not produce the same result..

vshobha avatar Dec 12 '21 01:12 vshobha

In this particular example, the different functions (with the different b values) are independent. I think because of this the optimization does not need to backpropagate through the NLLS. I'm guessing in a more complex example, like the motion planning and the tactile estimation examples, it would not produce the same result..

Hmm, it seems mismatched to have a tutorial with a title and introduction talking about differentiable NLLS but with contents that don't use the differentiable NLLS part. Maybe we can brainstorm/chat some more soon about other ways of demonstrating these derivatives?

bamos avatar Dec 12 '21 02:12 bamos

I see, yes, that's a good point. Let's discuss next week about how to change the functions.

vshobha avatar Dec 12 '21 02:12 vshobha

Good catch, @bamos! I was just playing around with something like y = a * exp(-b * x), and the issue persists; I guess once the inner opt finishes, for this particular form of loss, everything except a can be taken as a constant to get gradients properly, regardless of whether dL/da depends on b or not.

In any case, I agree it would be good to find other ways to illustrate these derivatives.

luisenp avatar Dec 12 '21 02:12 luisenp

[Updating the function as the previous version wasn't sufficient.] I think the minimum class of computation that will be needed is something like this:

b = min_b g1(a, b) 
a = g2(b)
b = min_b g1(a, b)

objective = g3(a, b) 

I think the computation will need to have at least 2 inner loop optimizations of b, with an update to a in between that depends on the optimized b, so that the second inner loop optimization is different from the first.

vshobha avatar Dec 12 '21 04:12 vshobha

I was actually thinking that we can probably achieve the desired behavior with a single optimization problem, by tweaking the example @bamos wrote in #26. Basically we fix a_target and b_target in y=ax^2 + b, and starting from, say, random values for x and y (or something appropriate), we set the outer loss to be L = (a_opt - a_target)^2 + (b_opt - b_target)^2 and set x and y as the outer opt params (and a and b those for inner opt). If I understand correctly, this gives us the desired behavior, because minimizing L requires da_opt/dx, da_opt/dy (likewise for b). Is this correct?

As an added benefit, if we do this, we can add an optional addendum at the end of the tutorial that illustrates the sensitivity visualization that @bamos suggested, which look visually quite nice. To me, this seems a reasonable compromise between having the sensitivity example and @vshobha's point about not accessing x.grad, since in the main body we would have already shown how to backprop to modify x and y using only Theseus API.

luisenp avatar Dec 12 '21 12:12 luisenp

I was actually thinking that we can probably achieve the desired behavior with a single optimization problem, by tweaking the example @bamos wrote in #26. Basically we fix a_target and b_target in y=ax^2 + b, and starting from, say, random values for x and y (or something appropriate), we set the outer loss to be L = (a_opt - a_target)^2 + (b_opt - b_target)^2 and set x and y as the outer opt params (and a and b those for inner opt). If I understand correctly, this gives us the desired behavior, because minimizing L requires da_opt/dx, da_opt/dy (likewise for b). Is this correct?

Yeah! I would see this as updating the dataset to optimize some loss and definitely showcases the derivatives. Since it uses directly da_opt/d{x,y}, maybe it would make sense to just start by plotting those without optimizing (like in #26), then talk about how that can be used to also create a "layer" or optimize objectives that modify the data (/other parameters) to influence some downstream loss that depends on the fit function. One interesting connection here is that MetaOptNet and Meta-learning with closed-form differentiable solvers can be seen as creating differentiable classifiers for meta-learning that modify some "latent data" to optimize a meta-loss when the problem is convex -- and I think mentioning these in the intro could well-motivate our simpler example of doing this in non-convex/NLLS settings too

bamos avatar Dec 12 '21 22:12 bamos

Also the least-squares auto-tuning paper considers differentiating linear least squares problems (thus convex problems) w.r.t. some hyper-parameters to tune things like regularization parameters that we can also connect to somewhere to further motivate the extension to non-convex/NLLS settings

bamos avatar Dec 12 '21 22:12 bamos

The goal of Tutorial 2 is to show the user how you can differentiate through the TheseusLayer abstraction. Since this example does not differentiate through the optimization itself, I'd suggest we just change the intro/title to simplify. As it is, it makes the point that we can create a TheseusLayer and differentiate through it. What actually gets ends up getting differentiated is under the hood from the user's perspective -- in this case, it turns out that differentiation through the optimization is not needed.

As for showcasing how to differentiate through the optimization, maybe we could do the following:

  • A new tutorial showing what @bamos and @luisenp have in mind (it is still difficult to understand the details of why Theseus is necessary for this example through the comments)
  • A separate new tutorial showing how a computation with multiple optimizations use the NLLS.

vshobha avatar Dec 12 '21 22:12 vshobha

Thanks for flagging @bamos and everyone for the discussion. Let's follow @vshobha's proposal, we'll merge the edits to T2 that clarify the language, and add new tutorials to demonstrate the ideas above before closing this issue. Later we can see if it makes sense to consolidate any if they flow well together.

mhmukadam avatar Dec 17 '21 18:12 mhmukadam