Perf-optimized WrapCylinder::_make_spiral_path
This is purely a perf change that aims to be a logically equivalent implementation, but with a bunch of computation skipped.
When I was doing the perf passes on recent PRs I noticed that RajagopalModel, which is wrapping heavy, spends a lot of computation in WrapCylinder::_make_spiral_path. I then investigated the method and found that it's doing what appears to be a large amount of computation redundantly. Things like re-fetching radiuses, matrix multiplications, computing closest points to lines in a suboptimal way (if some vectors are known to be unit vectors, some steps of that alg can be skipped).
So I carefully refactored the method--anyone who's worked near this part of the codebase knows how fraught with terror that process is :wink:--and I believe I managed to figure out what it's probably doing (it looks like it's just generating points on a cylinder's surface). I then gradually const-qualified things, renamed them for clarity, jumbled things around, etc. and ended up with this PR.
The overall story is that the method should be (hopefully) easier to read now--which is nice--but it's also faster, which has an impact on macro-scale benchmarks:
| Test Name | lhs [secs] | σ [secs] | rhs [secs] | σ [secs] | Speedup |
|---|---|---|---|---|---|
| RajagopalModel | 5.57 | 0.01 | 5.12 | 0.00 | 1.09 |
| Arm26 | 0.15 | 0.00 | 0.15 | 0.00 | 1.01 |
| ToyDropLanding | 10.49 | 0.00 | 10.58 | 0.03 | 0.99 |
| Gait2354 | 0.18 | 0.00 | 0.19 | 0.00 | 1.00 |
| passive_dynamic | 2.62 | 0.00 | 2.65 | 0.00 | 0.99 |
| passive_dynamic_noanalysis | 1.75 | 0.00 | 1.76 | 0.00 | 1.00 |
| ToyDropLanding_nomuscles | 0.27 | 0.00 | 0.28 | 0.00 | 1.00 |
Effectively, it buys a ~10 % perf-bump in a model that uses a lot of cylinders. I imagine with a little bit more, uh, massaging I'd be able to improve on this a little bit.
Brief summary of changes
- Added
CalcDistanceSquaredPointToLinethat takes aUnitVec(faster) -
const-qualified someprivatemethods inWrapCylinder -
const-qualified someWrapMathmethods - Heavily refactored
_make_spiral_path
Testing I've completed
- Test suite, perf benchmark - need to double-check it in the UI etc
Looking for feedback on...
- Usual stuff, correctness check (if you dare)
CHANGELOG.md (choose one)
- Can update with a generic-sounding message along the lines of "Improved wrapping performance for
WrapCylinder"
The primary goal of this PR is performance, where dropping those refactors would turn this into a fairly trivial tidy-up.
I am not particularly interested in a trivial tidy-up with this PR specifically. I want the 10pct performance boost it is able to yield without having to change the overall algorithim. This is a mathematical rearrangement of the same alg, keeping warts like the 2mm wrapping placement (now on my future hitlist, because it wont work with insect models) and goto's (also on the hitlist) in place.
This PR was created by gradually mathematically rearranging the existing implementation to do the same thing, but without some of the intermediate steps that are particularly harsh on the CPU.
Example: The existing implementation was generating a rotation matrix (m), generating another rotation matrix (R), inverting it (~R), and multiplying it (n = m * ~R), followed by only using one row in the result (n[1]). The mathematical rearrangement notices that the result vector (n[1]) is effectively the dot product of m[1] with each column of R (a right-multiplication: m[1] * R), this is equivalent to left-multiplication with ~R, so we can drop the inverse (~~R == R) and rewrite n = m * ~R as v = R * m[1], where v == n[1]. After this change, only one row of m is ever used (m[1]). An earlier step of calc_wrap_spiral sets m[1] = axis, so drop m completely and calculate v = R*axis. Dropping m removes a lot of the code that was setting it up m (eg cross products for generating an orthogonal axis set, Matrix::set calls, etc.) and some transpose ops.
Similar deal with the point-on-line alg. There's steps in it that mean that you can skip work if you know some preconditions hold (eg one vector is of unit length, because it's always the unit cylinder vector).
Mathematically, the main concern I have with this refactor is ensuring that NormalizeOrZero wasn't injecting unusual behavior into parts of it - that is something that should be reviewed.
You are more familiar with the pitfalls, though, so what should I be looking out for, specifically? What problems did you encounter when you gave this code area a swing? Are there some example files/scripts I can pump through this to see what explodes?
I'm extremely keen to get this perf uptick, given that it's almost entirely isolated to one small block of maths that everyone can review mostly in isolation. The alternative is something like a wide-ranging perf change that hits various APIs, this is comparatively simple, once we all review it.
I do not believe it is wise policy, long-term, for us to avoid these code sections. I think there needs to be a dialogue about what kind of tests, verification, etc. would be "good enough" (keeping in mind what's already lurking in there) for us to confidently start refactoring this wrapping code. Clearly, there's a general ambition to sort it out :)
@tkuchida thanks for reading through this :). I understand that comparing the before/after of this code section is particularly tricky - especially because I made things a little harder by renaming variables (for clarity, it will hopefully make it easier in the future, but it makes your job harder because you have to map r1a onto r1AxialPos etc.)
I've done a quick pass through with your comments and I'm ready for whatever you + @aseth1 spot next :)
I also have a personal branch with more extensive changes (more dramatic rearrangements, slightly bigger perf boost), but I'm going to park that as a local thing and focus on this. This is a ~10 % boost for a relatively small amount of changes. Later rearrangements yield maybe another 5 % but require (e.g.) putting the spiral implementation into a SpiralPath class (to separate "building a spiral" from "iterating on tangent points"), and performing higher-perf quaternion rotations directly (rather than simbody's implementation, which does axis+angle -> SimTK::Quaternion -> SimTK::Rotation -> matrix_multiplication). Some of those changes may ideally involve (e.g.) PRing simbody, reorganizing WrapMath, etc.
Hi @adamkewley, I haven't taken a line-by-line look through this yet, but at a high-level the changes you've made seem reasonable. Of course, the devil is in the details here, but I'm also interested in making these changes provided we can accompany them with the right tests. A lot of the changes here are things I noticed when making the wrapping parallelization fix a while ago.
Speaking of that PR (#2910), I found it helpful to create a set of temporary unit tests to double check the math changes (you helped improve them with randomization too). It could be useful to do that here too; they don't cover the regression tests we would need, but I think it was helpful to give everyone confidence that the math was sound.
If you do find time to get around to the suite of regression tests, I'd be happy to give them a look or a formal code review, whatever's helpful.
Thanks @nickbianco :) - I'll try and cook up a reasonable test suite for this
Closing this - it's been open for a while and I believe #3301 includes some parts of the optimization.
If we're going to use anything from this then it should be sliced out and put into a separate PR etc.