sleipnir comparison
https://github.com/SleipnirGroup/Sleipnir prominently features a benchmark involving casadi that gets beaten by a wide margin. Let's analyze what's going on in this test case.
TLDR;
With minor tweaking (e.g. expand, Opti.to_function), CasADi and plugins can surpass the Sleipnir benchmark results. Sleipnir is a neat implementation and may perform better than CasADi for selected applications, however such use cases are not present in the benchmark, and the gap will not be as spectacular as the benchmark graph would suggest.
Summary
Sleipnir brings some interesting ingredients (alternative AD on a scalar expression graph with partial caching, a clean vertically integrated c++23 codebase, low-overhead solver and Opti syntax) and may outperform CasADi AD in some circumstances, e.g. when you need to update your problem formulation on-the-fly.
Sleipnir's CasADi benchmarks are written with Opti which gives evaluation overhead (can be removed via Opti.to_function or modeling in nlpsol).
The flywheel example (a QP) has a pronounced bottleck in Ipopt, which we cannot really work around. Fwiw, CasADi+fatrop solver does surpass the Sleipnir results.
The cartpole example has a bottleneck in CasADi AD, which gets cleared by performing expand (Sleipnir uses scalar expression graphs like CasADi SX). Codegen could improve it further, but not needed to surpass the Sleipnir results.
details Let's look at evaluation time (quad): hessian_constant jac_c_constant jac_d_constant yes
| flywheel eval time, N=5000 | casadi+ipopt | casadi+ipopt (quad) | casadi+ipopt (quad,expand) | casadi+fatrop (expand) | sleipnir |
|---|---|---|---|---|---|
| reported in benchmarking: | 4304 ms | 4231 ms | 2899ms | 3071ms | 114 ms |
| ..↳spent in nlpsol | 2690 ms (*) | 2560 ms | 2580ms | 91ms | |
| ....↳spent in casadi AD | 200 ms | 107 ms | 6.5ms | 8ms | |
| ....↳ spent in solver internals | 2490ms | 2453 ms | 2570ms | 83ms |
| cartpole eval time, N=300 | casadi+ipopt | casadi+ipopt (expand **) | sleipnir |
|---|---|---|---|
| reported in benchmarking: | 8279ms | 2461ms | 2700ms |
| ..↳spent in nlpsol | 6670ms (*) | 1800ms | |
| ....↳spent in casadi AD | 5301ms | 548ms | |
| ....↳ spent in solver internals | 1369ms | 1252ms |
(*) difference between reporting and spent in nlpsol is due to Opti overhead, use to_function to eliminate overhead ( https://web.casadi.org/docs/#extras )
(**) requires using solve(...,...,"symbolicqr")
Setup time is something which has never been of important concern for CasADi. Typically, you perform the setup step once before you go in production and only perform the evaluation on the fly. Still, if you need faster setup time, for-loop equivalents ( https://web.casadi.org/docs/#for-loop-equivalents ) will help.
Flywheel is in fact a quadratic problem.
Sleipnir main page says: For quadratic problems, we compute the Lagrangian Hessian and constraint Jacobians once with no problem structure hints from the user.
In CasADi+Ipopt, the equivalent would be setting Ipopt options hessian_constant jac_c_constant jac_d_constant to 'yes'.
Assessment of Sleipnir AD:
-
Variable~SXElem -
VariableMatrix~SX, but dense and with Eigen interfaces -
Jacobianis sparse - It's not entirely clear how the Jacobian is efficiently computed; there is no coloring and no forward mode. Given the link to https://en.wikipedia.org/wiki/Automatic_differentiation#Beyond_forward_and_reverse_accumulation, this may well be vertex elimination (which is parked in a branch in casadi https://github.com/casadi/casadi/blob/ge8/casadi/core/sx_function.cpp#L1650)
- No separation of Function or graph. The AD happens on the graph directly, and the nodes retain caches (similar to https://github.com/eaertbel/expressiongraph )
- No codegen, no MX
Assessment of Sleipnir interior point solver
- generic sparse
- the solver is not separated from the AD
- implemented in a single Eigen-laiden function body with meaningful helper functions
- c++23
- flywheel benchmark gives proof that this solver has very little overhead (probably similar to MadNLP?)
Should we add Sleipnir as a plugin?
For context, Sleipnir was written in 2022 to solve trajectory optimization problems for https://github.com/SleipnirGroup/Choreo/. We switched from CasADi + Ipopt + MUMPS to Sleipnir because Sleipnir was faster at the time, CasADi's iteration callback API was underdocumented and frustrating to use in C++, and Sleipnir was easier to build and package for Windows, macOS, and Linux because all dependencies were buildable from source via CMake.
The GUI generates the problem formulation based on the project config, displays the trajectory-in-progress during the solve via an iteration callback, and writes the optimal solution to a JSON file. Direct transcription trajectory optimization problems have banded sparsity patterns, so each row of the KKT matrix only has a handful of values and the number of nonzeros in the KKT matrix scales linearly with number of samples.
You could get some Sleipnir autodiff and solver timing for your tables by passing {.diagnostics = true} to Problem.solve() or passing --enable-diagnostics to the benchmark executable.
Sleipnir's CasADi benchmarks are written with Opti which gives evaluation overhead (can be removed via Opti.to_function or modeling in nlpsol).
In my experience, code using Opti and opti.set_initial/opti.solve/sol.value is much easier to understand and maintain than code using Function or nlpsol because it matches the mathematical formalism much more closely. That's why Sleipnir tried to make a fast version of the Opti API. The benchmarks used the Opti stack with its defaults for an apples-to-apples API comparison.
We could probably add flags to speed things up, but one of Sleipnir's goals has been to avoid problem-specific tuning; instead, the solver itself should leverage the information it has to make those decisions on behalf of the user. Not sure if CasADi has some backwards compat limitations here.
I just took a look at https://web.casadi.org/docs/#extras, but I'm not understanding how to use to_function() in this case or how it helps since I need all the decision variable values anyway. Is Opti slower by default because it's recording a bunch of values the user doesn't need, and to_function() helps trim that set down?
Assessment of Sleipnir interior point solver ...
- the solver is not separated from the AD
Fixed as of https://github.com/SleipnirGroup/Sleipnir/pull/857, assuming the user can use Eigen types directly. Thanks for the suggestion; it cleaned up the solvers themselves quite a bit.
It's not entirely clear how the Jacobian is efficiently computed; there is no coloring and no forward mode.
Jacobian uses reverse mode autodiff with row caching based on the row's linearity. For example, given a constraint Jacobian with M rows (constraints) and N columns (decision variables), reverse mode is run M times where each run computes a row.
Most of the speedup comes from eagerly simplifying expression nodes whose arguments are either all constants, one 0, or one 1. For example: https://github.com/SleipnirGroup/Sleipnir/blob/main/include/sleipnir/autodiff/expression.hpp#L151-L167. I didn't see much change from disabling row caching, but the cart-pole solve was 50% slower without expression pruning.
To compute the Hessian, we first compute a gradient expression tree with reverse mode autodiff and cache it. Then when we want the Hessian's value, we take the Jacobian of that gradient with reverse mode autodiff.
The Hessian is the most expensive part of Sleipnir's cart-pole solve by far. Here's diagnostics for the cart-pole unit test with N = 100.
(Note there's some double-counting of f(x), cₑ(x), and cᵢ(x) I haven't fixed yet.)
We don't do any common subexpression elimination since that's really expensive in general. We expect the user to do that themselves since that makes expressions in their code shorter and easier to read anyway.
- No codegen, no MX
I assume MX is mainly helpful for DAE solvers? We didn't implement a native matrix autodiff type because optimal control NLPs rarely use them.
We've started looking at Enzyme for autodiff, but we haven't made much headway due to the complexity involved and the fact we'd lock ourselves into LLVM. We also historically haven't bothered with codegen because incorporating codegen properly into a consuming C++ project's build system is nontrivial.
If CasADi added a Sleipnir plugin for its solvers, Choreo would potentially use it for the faster autodiff if:
- We fix the AD performance regressions for our use case by applying the right flags or making minor tweaks. We don't want to compromise readability by switching to nlpsol.
- We build CasADi and its dependencies fully from source via CMake, to keep our app distribution easy (seems possible with the other plugins disabled)
- We figure out how to use CasADi's C++ iteration callbacks
CasADi's C++ usage docs seem to be neglected in favor of Python. The matrix and iteration callbacks will require mapping back and forth between CasADi's matrix representation and Eigen's compressed sparse column storage layout.