Examples of using differentiable least squares
📚 The doc issue
In the provided examples, the least square problem optimizes over all the parameters. However, in some applications, parts of the parameters are from the neural network and should be optimized with SGD, while the others can be directly optimized by the least square solvers. In Theseus, this is specified by the "inner loop" and "outer loop". Does the current version of PyPose support this?
Suggest a potential alternative/fix
Provide an example that the state space is a neural network to learn and the pose is to be optimized by least square solvers.
@zeroAska Yes, it is supported. You may do something like
opt1 = SGD(net1.parameters())
opt2 = LM(net2, strategy=strategy)
for i in range(epochs):
opt1.step()
for j in range(iterations):
opt2.step(input)
Bi-level optimization like this will be directly supported in a future release.
Thanks for the quick response. If net1 and net2 is the same nn.Module layer with different subset of parameters, is there a way to specify which subset of parameters is used for LM and SGD repectively?
You can use net.module1.parameters() and net.module2 to achieve this.
Thanks!! In the above net.module2's least square problem, there is a pose LieTensor in nn.Module.parameters whose initial value might need manual assignment for each problem and for each training example. An example of such application is the visual odometry where we will need to train the image encoder and perform least squares over the poses. How to specify the parameter's init value each time considering that it is a parameter in nn.Module?
Another question is that, if a batch of training data have different poses, are we able to multiple each pose with its corresponding data as a batch and launch different least square problems within a batch? For example, in a batch of 2, we have the pose batch [pose1, pose2] and we want to act on the batch of [image1, image2] to obtain [ pose1 @ image1, pose2 @ image2]
For initialization, it has no difference from a neural network, you may perform in-place value assignment for module parameters, e.g. net.module.weight1.data.fill_(value) before solving the problem. More information is here.
For the second question, if you mean each time you want to activate different parameters for a LM problem to solve, PyPose currently doesn't directly support this because LM or GN doesn't work for stochastic inputs, as it doesn't use gradient descent, so it will not converge as solutions will jump far away from the last iteration. However, technically you can do it by defining different optimizers for different parameters.
Many thanks!
As a followup question, for the above outer-inner loop setup, as the prediction is coming from the least squares, how is its gradient w.r.t ground truth is propagated through the least square layer?
We suggest only retaining the gradients from the last iteration for the inner optimization, as it will be more efficient and equivalent to back-propagating through the inner iterative optimization. More details you may refer to Sec 3.4 of this paper.
We suggest only retaining the gradients from the last iteration for the inner optimization, as it will be more efficient and equivalent to back-propagating through the inner iterative optimization. More details you may refer to Sec 3.4 of this paper.
An easy way to do this is to perform one more model forward after inner optimization, then do outer optimization.
Thanks for the paper link. I will check it out.
In the provided paper above, does the bi-level optimization (or the inner/outer loop) share the same loss? If the two stages have different loss to optimize, can we still use the trick of keeping the last iteration's gradients? For example, the inner loop that optimizes the pose might have a label-free loss, while the outer loop that optimizes the network parameters might have a supervised loss.
They don't have to have the same loss. Another example having the different loss functions is this paper.
We suggest only retaining the gradients from the last iteration for the inner optimization, as it will be more efficient and equivalent to back-propagating through the inner iterative optimization. More details you may refer to Sec 3.4 of this paper.
I noticed that the optimizers in pypose are set to be @torch.no_grad() (e.g. in optim.GN.step, optim.LM.step), so how can I back-propagate the gradient through the optimizers to the front-end neural network?
We suggest only retaining the gradients from the last iteration for the inner optimization, as it will be more efficient and equivalent to back-propagating through the inner iterative optimization. More details you may refer to Sec 3.4 of this paper.
I noticed that the optimizers in pypose are set to be @torch.no_grad() (e.g. in optim.GN.step, optim.LM.step), so how can I back-propagate the gradient through the optimizers to the front-end neural network?
After optimization, we suggest performing another forward operation for the loss, so that it can be backpropagated through the inner-level optimization with only one iteration.
For example, in the MPC example: In Line 231, we don't retain gradient, but then in Line 293, we perform another round of LQR, which bypass the multiple iterations and saves the computing time.
If the outer level loss is a supervised loss, does the outer level's gradient propagation method in the paper still hold?
Yes, Supervised loss is an easier case.
If the outer level loss is a supervised loss, does the outer level's gradient propagation method in the paper still hold? hi , did you understand how to propogate the gradient through the optimizer, i meet the same problem, i want to supervise the pose from LM