bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Animatable trait for interpolation and blending

Open james7132 opened this issue 3 years ago • 8 comments

Objective

Allow animation of types other than translation, scale, and rotation on Transforms.

Solution

Add a base trait for all values that can be animated by the animation system. This provides the basic operations for sampling and blending animation values for more than just translation, rotation, and scale.

This implements part of bevyengine/rfcs#51, but is missing the implementations for Range<T> and Color. This also does not fully integrate with the existing AnimationPlayer yet, just setting up the trait.

james7132 avatar Apr 15 '22 12:04 james7132

Removing the controversial label as the RFC has been merged. I'll update this PR and address the comments as soon as I can.

james7132 avatar Aug 03 '23 08:08 james7132

After some heavy research and trying to understand everything I have found out that godot's implementation is similiar to what we are going for and they have so far pretty much done all the heavy lifting.

We pretty much just have to rewrite what godot has in rust and then link it to the bevy-engine. https://github.com/godotengine/godot/tree/master/scene/animation

Also we would have to change somethings so that we can interface with it as in godot's implemention you interface it with the editor which we do not have yet.

e1pupper avatar Sep 21 '23 20:09 e1pupper

Removing the controversial label as the RFC has been merged. I'll update this PR and address the comments as soon as I can.

I don't know what this refers to but bevyengine/rfcs#49 is in Draft, not merged. And to me at least is very much controversial. See my review for context.

djeedai avatar Sep 30 '23 09:09 djeedai

bevyengine/rfcs#51 is the RFC in question, which was written to be independent of the primitives RFC and has been merged. I'll probably rewrite the primitives RFC when it comes time to do reflection-based "animate anything" work. For now, this is strictly focused on composition of skeletal animation and morph targets.

james7132 avatar Sep 30 '23 09:09 james7132

bevyengine/rfcs#51 is the RFC in question, which was written to be independent of the primitives RFC and has been merged. I'll probably rewrite the primitives RFC when it comes time to do reflection-based "animate anything" work.

Ok, this was confusing because the PR description only talks about bevyengine/rfcs#49 and says this PR implements it (which looks to be the case), while there's no mention of bevyengine/rfcs#51.

For now, this is strictly focused on composition of skeletal animation and morph targets.

Then I think a trait for each individual value is the wrong abstraction I think. If we want an optimized system for those two, and I think we do, then we should do specifically that and heavily optimize it. That means trying to gather as many Transform as possible and then batch-interpolate them with a tight SIMD loop (data oriented pipeline). There should be no function call there. This is the only way we can have a performant animation system handling complex/large skeletons at scale. I very much doubt we need the "interpolate a Handle<T>" in that part, there's no way we will have a skeleton with one Handle<T> per bone/joint or per vertex for a morph target. If we want the animation system to also account for Handle<T> animations then that's necessarily at a different, reduced scale, and then we can consider a different design. Here we're doing a generic approach mixing all scales.

djeedai avatar Sep 30 '23 15:09 djeedai

Ok, this was confusing because the PR description only talks about bevyengine/rfcs#49 and says this PR implements it (which looks to be the case), while there's no mention of bevyengine/rfcs#51.

Thanks. Forgot to update that since this PR was written originally while the RFC's review was under way. Done.

That means trying to gather as many Transform as possible and then batch-interpolate them with a tight SIMD loop (data oriented pipeline). There should be no function call there.

That is exactly what Animatable::blend is. Note how it has no self argument of any kind, it takes a iterator of BlendInputs, and there's zero use of dynamic dispatch within any of it's implementations. It's meant to be a way to gather a weighted blend of inputs from an iterator, which can be sourced from a cache-friendly data-oriented data store. The Vec3A, Vec4, and Transform implementations all rely on glam's built in SIMD acceleration. The Vec3 implementation is specialized to use Vec3A's since it's likely to improve performance, even when unaligned. Is there another SIMD operation that I'm missing here that would help that is not being included in that loop?

I very much doubt we need the "interpolate a Handle<T>" in that part

These other implementations are there for when we enable reflection-based animation, and will not be directly used. The trait was designed to support the immediate use case of skeletal/morph target animation, but not get in the way of a reflection based "animate anything" down the line. Handle<T> was primarily desirable for sprite and material swap animation.

there's no way we will have a skeleton with one Handle<T> per bone/joint or per vertex for a morph target. If we want the animation system to also account for Handle<T> animations then that's necessarily at a different, reduced scale, and then we can consider a different design. Here we're doing a generic approach mixing all scales.

As mentioned in the PR description, this PR only aims to introduce the trait, not dictate how it's integrated into existing components or systems. The current plan is to use this in conjunction with the animation graph presented in the composition RFC to enable composable skeletal/morph target animation, and that won't use implementations other than Vec3(A), Quat, and Transform, nor any form of dynamic dispatch.

james7132 avatar Sep 30 '23 23:09 james7132

Sorry for dragging that, I still don't understand how this trait is supposed to work with the current system in a way that's performant. I think the main red flag I have is impl Animatable for Quat (and Vec3), which hints at a "one call per translation/rotation/scale value".

That is exactly what Animatable::blend is. Note how it has no self argument of any kind, it takes a iterator of BlendInputs, and there's zero use of dynamic dispatch within any of it's implementations. It's meant to be a way to gather a weighted blend of inputs from an iterator, which can be sourced from a cache-friendly data-oriented data store.

I was writing an answer with pseudo-code and I think I understood what's going on. You intend blend() to be called once with the entire collection of all Transform from all sources, including all animations being blended, possibly from different animation roots, is that correct? That is, ideally, you intend <Transform as Animatable>::blend() to be called exactly once per frame, no matter how many animations from how many graphs are being animated. Did I get that right?

Assuming this is the case, this is a lot better than I thought, thanks for your patience! I do have a couple of remarks though for performance:

  • The layout of BlendInput<T> is not great, and becomes worse the bigger T is and the higher its alignment is. I think the type will always have padding, no matter what T we use (assuming at least 4 bytes like f32, because it's unlikely we'll blend some u8). This will waste some space on the CPU cache and increase cache misses. I think it would be best to tightly pack the T values, and tightly pack the weight values (so struct of arrays, instead of array of structs). Reading from 2 different arrays in a hot loop is not too bad perf-wise if the loop is long enough.
  • The additive per BlendInput forces the if check to be inside the hot loop. I think this is bad, given you will likely have sequences of 64~256 transforms (bones/joints) with the same constant value, then comes the data for another animation which changes that. I'd try to find a way to move that if outside the loop. Maybe we need 2 loops, one for additive blend and one for non-additive one. Or maybe we need to replace a single flat iterator with a nested one (iterate over batches, which can be additive or not, and each batch has a {value + weight} pair only, inheriting the additive-ness from the batch itself).

The Vec3A, Vec4, and Transform implementations all rely on glam's built in SIMD acceleration. The Vec3 implementation is specialized to use Vec3A's since it's likely to improve performance, even when unaligned.

Fair, but the Vec3 doesn't (even thought there's probably no SIMD in glam for that) and neither does the Quat one. Given that it's likely more performant and space-effective to split Transform animation by tracks/curves along each translation/rotation/scale (plenty of animations have unit scale for example, this is a big waste to blend thousands of Vec3::ONE) I expect impl Animatable for Quat to be one of the most used code of the entire PR (for skeletal animation), closely followed by impl Animatable for Vec3/A, and impl Animatable for Transform be marginal for those very complex animations that animate all of translation/rotation/scale at once (but even then we could split the processing).

Is there another SIMD operation that I'm missing here that would help that is not being included in that loop?

It's not a clear cut to me that {load + lerp + store} in SIMD is faster than lerping on CPU. This definitely used to be false, although more recent architectures are tremendously faster at load/store. In any case, this probably needs 1) removing the if from the middle as previously mentioned, and 2) possibly unrolling the loop manually (say, 4 times, to keep SIMD registry use low) to pipeline the loads/stores. That way I think this can become a win over pure CPU. Another trick that unrolling 4 times helps is loading 4 Vec3 tightly packed in memory into 3 registers directly without CPU copy (48 bytes), shuffling to convert to Vec3A directly inside the SIMD registers, then blend and do the opposite to store back to CPU. I don't think the compiler would be able to do that directly from the current code, it will likely have to copy the Vec3 into a stack-allocated Vec3A, load that (so 4 loads instead of 3) and then on store do yet another copy from Vec3A to Vec3. All of that needs to be profiled obviously.

Ok so overall I don't have any more concern performance-wise provided this PR allows the above suggestions to be implemented in the future (or proven they're not needed). This means that my last blockers I think would be:

  1. That Quat::interpolate() implemented manually (although, see remark below, could be ignored for now and fixed in a later PR, it's just that this looks like a reimplementing of glam)
  2. The rotation blend in Transform::blend() that I find pretty dodgy. Maybe I got the maths wrong but weighting Quats looks incorrect. The glam implementation of impl Mul<f32> for Quat clearly casts the Quat to Vec4 so doesn't interpret it as a rotation in any way. And scaling all components of a Quat as far as I know produce nothing that looks like a rotation.
  3. The per-value additive and the if inside the hot loop, and the layout of BlendInput, which we can't really change once the PR is merged without breaking everything.

We can optimize the individual Animatable implementations later.

djeedai avatar Oct 01 '23 12:10 djeedai

I've submitted several PRs to James's branch to address some of the reviewer comments: https://github.com/james7132/bevy/pulls

Assuming those are all accepted we need to:

  1. Decide whether animations or BlendInput should control whether or not blending is additive.
  2. Consider improving the layout of BlendInput<T>.

I think that the following work should be handled in follow-up PRs (this PR is very old by now):

  1. Benchmarking and optimizing interpolation performance (including the question about Vec3A).
  2. Adding handle/asset-based interpolation (the assets v2 rework broke existing code).
  3. Providing an end-user API (and examples) for this feature.

alice-i-cecile avatar Jan 11 '24 22:01 alice-i-cecile

I don't have much to add to the review other than what @rodolphito said, looks good otherwise.

pcwalton avatar Jan 29 '24 20:01 pcwalton

There's a bigger problem with switching to nlerp i've discovered: Glam's nlerp is not commutative, nor can it easily be made commutative. This is because it is normalizing after interpolating, which makes sense for blending two quaternions, but it also means that its no longer a truly linear operation. To commutatively blend an arbitrary number of quaternions, one would need to defer the normalization til the very end. This is problematic however because the intermediary values are not true rotation quaternions because they are not unit length. We'd want to encode this in the type system somehow. I think it can be done with a PR to glam, but it might be contentious as it adds complexity and would need some extra implementations. There would need to be a DenormalizedQuaternion type that implements Into<Quaternion>, and have specialized lerp methods that take combinations of DenormalizedQuaternions and Quaternions to linearly interpolate them. Then at the end it would be silently normalized and converted into a Quaternion as soon as it gets used.

For this reason I am opting to leave slerp in for the moment, because this would need further discussion and upstream work in glam.

atlv24 avatar Jan 31 '24 05:01 atlv24

I've realized there actually is a simpler way forward that can be implemented now without changes to glam.

atlv24 avatar Jan 31 '24 05:01 atlv24

BlendInput says additive determines "Whether or not to additively blend this input into the final result" which is actually not what its doing. Additive BlendInputs are being applied in the order they are given in the blend stack, meaning if a non-additive BlendInput is present after an additive one, the additive one is not blended into the final result as claimed. This will look wrong and be unintuitive, compared to most animation solutions ive worked with

atlv24 avatar Jan 31 '24 06:01 atlv24

@rodolphito I've been told that what this PR does actually matches what Unity does. In Unity override layers cancel out the results of previous additive layers.

pcwalton avatar Feb 02 '24 00:02 pcwalton

For posterity, the discussion that led to this being merged can be found here: https://discord.com/channels/691052431525675048/774027865020039209/1202358388466384956

james7132 avatar Feb 02 '24 21:02 james7132