drake [solvers] Allow evaluators to provide gradients w/o AutoDiffXd

Consider the cost of using EvaluatorBase for a Cost/Constraint with a huge number of parameters (like the ones we get with MultilayerPerceptrons). On each call to Eval(AutoDiff), we using InitializeAutoDiffXd(x) to create an enormous set of .derivatives() to inefficiently represent the identity matrix, which the Evaluator doesn't even want to use. Furthermore, because the method could be called with any AutoDiffXd on the input, each implementation must implement the chain rule to compute dydx*dxd__ , when we really just want dydx.

This PR provides a shortcut, so that implementations of EvaluatorBase can provide the relevant gradients (only). In the short term, derived classes will also have to implement the DoEval(AutoDiff..), or throw.

I've updated only snopt_solver to use this so far, just to initiate the concept.

This change is

Feb 20 '22 14:02 RussTedrake

+@hongkai-dai , +@jwnimmer-tri ... seeking your feedback on this concept.

I was shocked when I dug in to find that we were calling InitializeAutoDiffXd in snopt_solver on every evaluation of a constraint. That's got to be inefficient! This PR doesn't make that any better nor worse for the existing Evaluators, but it moves the allocation into EvaluatorBase (where I would like to think we could avoid it). It does avoid that work for Evaluators that choose to implement the DoEvalWithGradients directly.

If we care about trajectory optimization performance, I think a little bit of TLC in snopt solver could go a very long way.

Feb 20 '22 14:02 RussTedrake

solvers/evaluator_base.cc, line 42 at r1 (raw file):

  DoEval(x_ad, &y_ad);
  *y = math::ExtractValue(y_ad);
  *dydx = math::ExtractGradient(y_ad, x.size());

btw -- there is a program flow here where the previous version did not have to allocate a zeros matrix when derivatives was empty. This new program flow will always allocate that matrix.

Feb 20 '22 14:02 RussTedrake

I'm having trouble reproducing the snopt_solver failure on focal. Will try more tomorrow.

Mar 15 '22 02:03 RussTedrake

solvers/evaluator_base.cc, line 37 at r3 (raw file):

    Eigen::MatrixXd* dydx) const {
  // TODO(russt): InitializeAutoDiff here is doing expensive and trivial work.
  // How can we avoid it (in a thread-safe way)?

@jwnimmer-tri -- could I get your thoughts on this?

My current thinking is this:

One of the main performance advantages of this PR is allowing constraints to avoid these allocations. But I'm worried that it there is not going to be a great solution for doing this at the Eval level, and that it would be better to handle the pre-allocations at the level of the Solver.
Although special constraints like PositionConstraint will be able to implement DoEvalWithGradients directly, important constraints like DirectCollocationConstraint will not (at least not soon). If we are going to add analytical gradients to e.g. MBP, I currently think that the only viable route that will work well with the entire systems framework is to use the AutoDiffXd entry point, compute with doubles, but return the solution inside the gradient fields on an AutoDiffXd return argument.

As such, if this is the wrong place to avoid these allocations, then I would actually consider closing this PR and focusing first on the primary MBP workflow target.

Mar 15 '22 09:03 RussTedrake

FYI, I've just started to work on this. Just wondering whether there is any local progress on the benchmark?

May 04 '22 16:05 aykut-tri

Great! I did finish https://github.com/RobotLocomotion/drake/blob/master/multibody/inverse_kinematics/position_constraint_benchmark.cc and pushed it to master (as the conflicting files message notes here). But that's the only benchmark I've landed so far.

May 08 '22 11:05 RussTedrake

Nice, thanks! I'm working on a relaxed-IK benchmark to perform an A/B comparison for this feature.

May 09 '22 15:05 aykut-tri

solvers/evaluator_base.cc line 37 at r3 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

... where we give the DoEval implementations a scratchpad pointer.

Actually, the any argument is probably too difficult in practice. It might be time to finally implement thread_local_unique_ptr. Here's the API outline:

https://github.com/jwnimmer-tri/drake/commits/thread_local_ptr

The question would be if we can adjust the find-or-lazy-create implementation to be sufficiently fast to use. It should be do-able, but will probably involve bringing in third-party library dependency that implements concurrent collections. It would be guided by benchmarking.

In #17492, InitializeAutoDiff does not allocate any memory for the derivatives, rather just the one malloc for the entire matrix x_ad itself. AutoDiff scalars representing the unit partials are stored as pair of ints upon construction (their non-zero unit index, and their size) . Even when reading out a (possibly-scaled) unit partial with AutoDiff::derivatives() const nothing gets allocated -- it returns a view into a readonly singleton with the unit vector memory. That is probably sufficient to cover this TODO.

Jul 04 '22 04:07 jwnimmer-tri

I think we might choose to revisit it later, e.g. for the multilayerperceptron case. But closing for now seems best.

Oct 20 '22 09:10 RussTedrake

Out of curiosity, I ran RelaxedPosIkBenchmark/Iiwa on latest master just now vs master with this PR cherry-picked in. Both were with --config snopt enabled.

Branch        Time per iteration
------        ------
master        13.66 ms
master+16335  12.62 ms

Oct 25 '22 04:10 jwnimmer-tri

Thanks for sharing the result. It seems pretty similar to the original result for that benchmark, i.e., ~7% improvement.

For reference, we were expecting a higher impact on the speed of a nonlinear program w/ only position constraints due to the ~56% improvement per position constraint evaluation.

Oct 25 '22 15:10 aykut-tri

drake drake copied to clipboard

[solvers] Allow evaluators to provide gradients w/o AutoDiffXd

drake
drake copied to clipboard