Use the furthest descendent when calculating the joint error locally
When we optimize the bit rates, we do so in two phases: first we find the smallest bit rate that satisfies the precision threshold in joint local space, then we iterate to find the optimal bit rate while considering the joint object space.
Working in local space is great but it doesn't take into account the error from any of the joints' children. To mitigate this, instead of considering only the shell distance of the current joint, we could tally up each shell distance of all the joints in the chain. Each child would contribute its shell distance and we could store per child the total distance to consider for our parent joint. Then when we test the bit rates locally, we could quickly evaluate the error for each child as well by using this new sum shell distance. Each joint can still respect its own precision threshold and all of this is fast because we do not need to sample each child joint. We assume that they have perfect precision for this initial phase. Once we find our smallest bit rate that satisfies our joint and all its children, we can move on to the second phase as we did before.
This would most likely speed up compression because we'll start the expensive second phase with a much better initial state. We should have less work left to do before we converge on our solution. It may or may not improve quality as well.
Here are some notes I wrote a while ago on this:
The bit rate optimization is currently done in two phases. Phase 1 tests the local bone error/contribution only. We find the lowest bit rate that satisfies the error threshold for each bone in local space of that bone. Phase 2 iterates over every bone starting at the root and attempts to find the optimal bit rate so that the error for that bone satisfies the error threshold in object space.
Phase 1 is very quick but if the guess found is very poor, phase 2 can take a long time to converge, or it might fail. Both phases only ever increment the bit rate to try and reach the error threshold.
Instead phase 1 should test all the virtual vertices (not just the local 3) that are impacted by that bone: itself and its children. To do this, virtual vertices can be calculated in object space and transformed in the space of that bone. These can easily be stored in SoA form since we have at least 3 and multiples of 3. Processing this in SIMD will be very fast as no matrix multiplication needs to happen since everything is in local space of the bone being tested. This means that the error contribution from each intermediate bone is ignored, they are assumed to be full precision. This should give a much more accurate guess after phase 1 since the leaf bone virtual vertices will be taken into account for every bone.
Virtual vertices can either be transformed into object space and back into each bone's local space OR the virtual distances of each bone can be added down the chain. A bone chain that forms a loop would see virtual vertices that are far, properly taking into account the error down the chain otherwise in local space it might be very close to the root and not be representative as much.
To calculate this new total sum virtual vertex distance, we can calculate the local space bone transforms, for each bone calculate the distance of a dummy virtual vertex very close to 0.0 (error threshold?), use it at the bone length (that takes into account translation/scale), and replace the bone transform by a pure translation transform using that value along a fixed axis (X). Each bone will be flat along the same axis and we can transform the whole thing in object space or we can add along the way the transformed dummy virtual vertex distance (take the max of 3 to account for 3d scale?). If this value takes into account the error threshold, and assuming that we always find a bit rate that keeps the error under it, can this entirely remove the need for phase 2?
Virtual vertices can be calculated either using the bind pose or for each frame. e.g with the bind pose, we have a set of 3x * num bones in chain * num bones and otherwise we calculate it for each frame. This can easily be done per segment (and later in parallel). For each frame probably makes more sense, that way when we test a bit rate for a single bone, we can test all possible virtual vertices of all frames impacted. To calculate the virtual vertices we only need to transform things in object space once for each frame, minimizing the number of matrix multiplications: calculate everything first, then test in SIMD all virtual vertices for every bit rate. If we use the total distance (see above), we only need 1 distance value for each bone/frame (the furthest virtual vertex from the chain). Any virtual vertex that is closer to the bone will have a lower error guaranteed, no?
This worked better than I expected. See https://github.com/nfrechette/acl/pull/376. It stole some thunder from bit rate expansion, but that's a good problem to have IMO.