drake
drake copied to clipboard
[solvers] Allow evaluators to provide gradients w/o AutoDiffXd
Consider the cost of using EvaluatorBase for a Cost/Constraint with a huge number of parameters (like the ones we get with MultilayerPerceptrons). On each call to Eval(AutoDiff), we using InitializeAutoDiffXd(x) to create an enormous set of .derivatives() to inefficiently represent the identity matrix, which the Evaluator doesn't even want to use. Furthermore, because the method could be called with any AutoDiffXd on the input, each implementation must implement the chain rule to compute dydx*dxd__ , when we really just want dydx.
This PR provides a shortcut, so that implementations of EvaluatorBase can provide the relevant gradients (only). In the short term, derived classes will also have to implement the DoEval(AutoDiff..), or throw.
I've updated only snopt_solver to use this so far, just to initiate the concept.
+@hongkai-dai , +@jwnimmer-tri ... seeking your feedback on this concept.
I was shocked when I dug in to find that we were calling InitializeAutoDiffXd
in snopt_solver on every evaluation of a constraint. That's got to be inefficient! This PR doesn't make that any better nor worse for the existing Evaluators, but it moves the allocation into EvaluatorBase (where I would like to think we could avoid it). It does avoid that work for Evaluators that choose to implement the DoEvalWithGradients directly.
If we care about trajectory optimization performance, I think a little bit of TLC in snopt solver could go a very long way.
solvers/evaluator_base.cc, line 42 at r1 (raw file):
DoEval(x_ad, &y_ad); *y = math::ExtractValue(y_ad); *dydx = math::ExtractGradient(y_ad, x.size());
btw -- there is a program flow here where the previous version did not have to allocate a zeros matrix when derivatives was empty. This new program flow will always allocate that matrix.
I'm having trouble reproducing the snopt_solver failure on focal. Will try more tomorrow.
solvers/evaluator_base.cc, line 37 at r3 (raw file):
Eigen::MatrixXd* dydx) const { // TODO(russt): InitializeAutoDiff here is doing expensive and trivial work. // How can we avoid it (in a thread-safe way)?
@jwnimmer-tri -- could I get your thoughts on this?
My current thinking is this:
-
One of the main performance advantages of this PR is allowing constraints to avoid these allocations. But I'm worried that it there is not going to be a great solution for doing this at the Eval level, and that it would be better to handle the pre-allocations at the level of the Solver.
-
Although special constraints like PositionConstraint will be able to implement DoEvalWithGradients directly, important constraints like DirectCollocationConstraint will not (at least not soon). If we are going to add analytical gradients to e.g. MBP, I currently think that the only viable route that will work well with the entire systems framework is to use the AutoDiffXd entry point, compute with doubles, but return the solution inside the gradient fields on an AutoDiffXd return argument.
As such, if this is the wrong place to avoid these allocations, then I would actually consider closing this PR and focusing first on the primary MBP workflow target.
FYI, I've just started to work on this. Just wondering whether there is any local progress on the benchmark?
Great! I did finish https://github.com/RobotLocomotion/drake/blob/master/multibody/inverse_kinematics/position_constraint_benchmark.cc and pushed it to master (as the conflicting files message notes here). But that's the only benchmark I've landed so far.
Nice, thanks! I'm working on a relaxed-IK benchmark to perform an A/B comparison for this feature.
solvers/evaluator_base.cc
line 37 at r3 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
... where we give the DoEval implementations a scratchpad pointer.
Actually, the
any
argument is probably too difficult in practice. It might be time to finally implementthread_local_unique_ptr
. Here's the API outline:https://github.com/jwnimmer-tri/drake/commits/thread_local_ptr
The question would be if we can adjust the find-or-lazy-create implementation to be sufficiently fast to use. It should be do-able, but will probably involve bringing in third-party library dependency that implements concurrent collections. It would be guided by benchmarking.
In #17492, InitializeAutoDiff does not allocate any memory for the derivatives, rather just the one malloc for the entire matrix x_ad
itself. AutoDiff scalars representing the unit partials are stored as pair of ints upon construction (their non-zero unit index, and their size) . Even when reading out a (possibly-scaled) unit partial with AutoDiff::derivatives() const
nothing gets allocated -- it returns a view into a readonly singleton with the unit vector memory. That is probably sufficient to cover this TODO.
I think we might choose to revisit it later, e.g. for the multilayerperceptron case. But closing for now seems best.
Out of curiosity, I ran RelaxedPosIkBenchmark/Iiwa
on latest master just now vs master with this PR cherry-picked in. Both were with --config snopt
enabled.
Branch Time per iteration
------ ------
master 13.66 ms
master+16335 12.62 ms
Thanks for sharing the result. It seems pretty similar to the original result for that benchmark, i.e., ~7% improvement.
For reference, we were expecting a higher impact on the speed of a nonlinear program w/ only position constraints due to the ~56% improvement per position constraint evaluation.