Kinemation-Skinning-Prototype
Kinemation-Skinning-Prototype copied to clipboard
Unity DOTS custom skinned mesh rendering prototype for the Latios Framework
Kinemation Skinning Prototype
This is a Unity DOTS project built upon Puppet-Test for the purpose of developing a new skinning solution known as Kinemation as part of the Latios Framework.
This prototype is NOT ready for production.
The procedural animation is still done with 10 reference game objects, but these are used to drive 10k entity dancers.
The latest master commit uses DOTS 0.50 and Unity 2020.3.30f1.
How to Test
The project contains two important scenes: Party DOTS and Part DOTS Optimized. The former uses an entity per bone in each dancer. The latter uses dynamic buffers to represent the bones.
Each scene contains a Toggle game object with two checkboxes. If Enable Hybrid is checked, skinning and rendering is handled completely by the Hybrid Renderer using compute deformations. If it is unchecked, skinning and rendering is handled by the Kinemation solution, which uses a different compute deformation design. If Enable Blending is checked, each dancer entity will blend randomly between two of the ten reference game object-based procedural dancers. This can be quite expensive. If unchecked, each entity dancer will copy the pose of one of the reference procedural dancers.
Important
Do not modify the toggle checkboxes while in play mode. The controls affect conversion.
Design and Goals
Kinemation was designed primarily with two goals in mind:
- Provide an API and backend systems for driving skeletal animation from normal ECS systems (no animation graph required).
- Improve GPU performance by sacrificing some CPU performance.
In Kinemation, skinning is driven by skeletons, and there are two types:
- Exposed – Each bone is an Entity
- Intuitive for programmers and saves development time
- Much more viable with DOTS than with game objects
- Automatic hierarchy updates via Transform System
- Great for procedurally-generated data
- Optimized – Bones are stored in packed Dynamic Buffers
- Faster on CPU than Exposed
- Better data locality when animation logic is simple
- Hierarchy updates dispatched manually
- Exported bone support
- Great for fully authored data
Differences from Unity’s Animation and Hybrid Renderer Packages
For optimized hierarchies (what the animation package always uses), Unity stores the bind poses and hierarchy information in dynamic buffers, while Kinemation stores this information in blob assets.
Unity requires meshes be marked as Read-Write enabled at runtime. Kinemation only requires read-write access at conversion time, as it stores skinning data in a blob asset using a custom compute-optimized data structure.
Unity handles bone hierarchies in root space, and reparents all skinned meshes to the root. This matches the Skinned Mesh Renderer behavior of game objects. Kinemation operates in skeletal reference space and leaves the hierarchy as is.
Unity uses Immutable
compute buffers and creates a compute shader dispatch for
each unique mesh. Kinemation uses SubUpdate
compute buffers for per-frame data
and a single compute shader dispatch can process many different meshes at once.
Unity Exclusive Features
Unity’s solution currently supports Blend Shapes. These are not currently supported in Kinemation but will be in a future release. Similarly, Unity supports 4-bone vertex skinning. Kinemation is compute-only.
Unity has occlusion-culling. Kinemation disables this for the entire project. More research is required to make it work correctly with Kinemation.
Lastly, Unity has fully animation clip sampling and blending system with authoring tools and workflows. This prototype version does not have anything like that. The first release version will have simple code-driven animation clip support. Unlike Unity, runtime conversion will be supported (assuming that’s still a thing by then).
Kinemation Exclusive Features
Kinemation has true frustum culling support for skeletons. This means that animated meshes will have proper bounds updated every frame. In addition, only meshes that pass culling and LOD selection will be skinned on the GPU. Note that frustum culling is per-skeleton rather than per-mesh.
Kinemation has a rebinding system which allows for fast and intuitive binding and rebinding of meshes to skeletons at runtime as long as the mesh was designed for the same number of bones as the skeleton. This is useful for things like character customizations where all customizations are skinned to the same base skeleton in the DCC package.
Kinemation does not require any bones to be exposed during conversion of an optimized skeleton.
How It Works
Kinemation has five primary entity archetypes:
- Skinned RenderMesh
- Exposed Bone
- Exposed Skeleton
- Exported Bone
- Optimized Skeleton
The data associated with these archetypes are designed to minimize random memory accesses where possible.
It is also worth noting that Kinemation makes use of several features of the Latios Framework. Also, the current systems are not all placed in the most correct locations in the player loop for simplicity of prototyping. However, now that the design is mostly complete, this section will typically call out the intended correct locations for these systems.
Conversion
Currently, a single conversion system is responsible for generating all components and blobs. Each Skinned Mesh Renderer gets classified as belonging to an exposed skeleton, an optimized skeleton, or an invalid skeleton. Conversion then typically involves converting the entire skeletons and all skinned meshes using it at once. The shared meshes get converted separately to blobs.
For optimized skeletons, the bone hierarchy data needs to be extracted. A
temporary Game Object gets instantiated and
AnimatorUtility.DeoptimizeGameObjectHierarchy()
exposes all the underlying
transforms as GameObjects
for analysis. Currently, a hack component exists for
identifying exposed bones, and this solution likely won’t work for prefabs
referenced in subscenes.
Binding Reactive System
SkeletonMeshBindingReactiveSystem
is the first system to process converted
entities at runtime. As its name suggests, this is a reactive system. It
generates a sync point when necessary and reacts to new and deleted skeletons as
well as new and deleted meshes. It also reacts to rebindings via changes to the
BindSkeletonRoot
component, although this does not incur a sync point.
Existence and bindings are assumed to be relatively persistent, so this system
tries to perform many expensive random-access operations in order to avoid such
operations in per-frame updates.
One of these operations is skinning source data allocation inside the two mesh
compute buffers. One compute buffer stores the mesh vertices, while the other
stores the weights. New and dead mesh blobs get collected into an
UnsafeParallelBlockList
and an IJob
processes this during the sync point.
This job maintains a MeshGpuManager
collection component which handles all the
occupied and free ranges of meshes as well as reference counts. It also
generates an upload command list for a future system.
The second operation is exposed skeleton culling index linking. Each unique exposed skeleton gets assigned an index which needs to be synchronized with all of its exposed bones. These indices will be used to index bit arrays for optimally passing culling information from bones to the skeleton, avoiding expensive cache-thrashing random accesses.
The third operation is the actual binding changes between meshes and skeletons.
These are also recorded in an UnsafeParallelBlockList
. The recorded operations
are sorted and batched during the sync point. Then after the sync point, a job
runs through all skeletons and updates all data related to the bindings.
Skeletons keep a list of bound mesh entities in the dynamic buffers. They also
cache references to GPU memory for the mesh skinning source data. And lastly,
each mesh blob contains bounding spheres for the mesh for each bone. The
skeletons use this to bake bounding spheres for all bound meshes for each bone.
In the case of exposed skeletons, this also gets forwarded to all the individual
bones via random access.
This system is set to run in the presentation structural changes system group. However, it can be run multiple times per frame at different locations if moving its sync point is beneficial.
Optimized Skeleton Bone Export
Optimized skeletons have the ability to export bone transforms to external
entities. Currently, this export is done using a LocalToWorld
[WriteGroup]
override. However, this will be replaced to use LocalToParent
for the official
release. The change will remove the need for a system like
TempFixExportedTransformsSystem
to be scheduled so precariously around the
TransformSystemGroup
when exported bones have children and skeletons have
parents.
Skeleton and Bone Bounds Update
SkeletonBoundsUpdateSystem
is responsible for updating both optimized skeleton
and exposed bone bounds. Optimized skeletons calculate a world-space AABB for
the entire skeleton hierarchy. Exposed bones calculate a world-space AABB for
each bone individually. Both archetypes update a chunk bounds component when
necessary. Change and order version filtering are both used.
Compute Deform Buffer Allocation
Kinemation makes use of the same Shader variables and skinning buffer format as
Unity. That means that any shader graph shader using the Compute Deformation
node or any custom shader compatible with Hybrid Renderer’s compute deformation
skinning will be compatible with Kinemation. Each RenderMesh
gets a material
instance property with an index into a compute deformation vertices buffer. This
buffer reserves memory for all deformed vertices for all deforming RenderMesh
instances in the ECS World. A downside to this is that this buffer gets huge and
wasteful. More research is needed to resolve this.
Because the buffer memory is valuable, indices are reassigned every frame to
reduce fragmentation. ChunkComputeDeformMemoryMetadata
maintains state for
this process. UpdateChunkComputeDeformMetadataSystem
looks for changed meshes
or structural changes and caches vertex counts and chunk counts in the chunk
component. These results are used by AllocateDeformedMeshesSystem
which
iterates the meta-chunks in a single-threaded job and computes prefix sums at
the chunk level. Iterating over meta-chunks like this is incredibly fast and
effectively removes the performance penalty of a single-threaded prefix sum.
Once the prefix sums are computed, a parallel job distributes unique vertices to
the individual entities.
It is worth noting the buffer index assignment assigns lower indices to meshes rendered the previous frame. This was an experiment and currently serves no practical purpose.
Per Frame Setup
The Hybrid Renderer uses the BatchRendererGroup
API to provide drawing
information into the engine. This API invokes a culling callback each time
culling needs to occur. Culling occurs multiple times per frame, usually while
the Scriptable Render Pipeline code is executing. When profiling play mode in
the editor, three callbacks happen per frame. One is for the main camera, one is
for shadows, and one is for the scene view.
Kinemation dispatches skinning during these culling callbacks in order to avoid
skinning invisible meshes. Skinning only has to happen once per visible mesh
across all culling callbacks, so Kinemation caches skinning state in components
on both skeleton and mesh entities. This data needs to be reset each frame prior
to culling callbacks. ClearPerFrameSkinningDataSystem
performs this.
Skeletons cache which bone matrix buffer and bone offsets they used when first skinning a batch of meshes. This is useful when different mesh LODs bound to the same skeleton are selected during different culling callbacks.
Meshes store three Booleans in a byte-sized component. One of these is the selected LOD. This never gets cleared and is kept up-to-date in the LOD selection system. The second is whether the mesh should be rendered for a given callback. This is cleared for every callback and also at the beginning of the frame. The third flag specifies whether the mesh was rendered at least once during the frame by a previous callback. When this flag is set, the mesh has had a skinning dispatch for the frame.
Source Mesh Data Upload
All Compute Buffers are pooled using a type called ComputeBufferTrackingPool
.
A Collection Component wraps this and exists on the worldBlackboardEntity
.
BeginPerFrameMeshSkinningBuffersUploadSystem
is responsible for creating this.
It updates the pool using async readback fences to ensure SubUpdate
buffers
get recycled properly. It then reserves mesh buffer sizes and populates the
source mesh upload buffers in jobs. However, it does not submit the buffers, but
rather leaves that state in a MeshGpuUploadBuffers
collection component for
the next user to finish and dispatch. Usually, the buffers writes are committed
and the upload compute shaders are dispatched during the first culling callback.
However, if no culling callback is performed, this system will perform these
steps at the beginning of the next frame to ensure the GPU stays in sync with
the CPU allocation management.
Latios Hybrid Renderer Replacement
Unfortunately, Unity’s Hybrid Renderer does not support custom culling. This is
something they are working on, but Kinemation
couldn’t wait. The solution is
to replace Unity’s Hybrid Renderer with a custom one using a different culling
callback.
This replacement does several things differently. First, it forces all batch bounds to the max extents (1e9f) for any chunks containing skinned meshes.
Second, it creates the KinemationCullingSuperSystem
.
And third, in the culling callback, it places all the callback arguments into a
collection component and then invokes Update()
on the
KinemationCullingSuperSystem
. Once the update finishes, it returns the
collection component’s JobHandle
.
Skeleton LODs Update
Inside the culling callback, the first system to run is
UpdateSkinnedLODsSystem
. This system runs what is essentially the LOD update
job from Unity’s Hybrid Renderer. However, there are a few modifications.
First, the LODs have to be propagated to the per-mesh-entity flags so skeleton culling can easily look them up via entity in a job.
Second, Unity’s implementation had some issues managing state and change filters
using EntityQuery
dependencies. But with a proper system, this could be
corrected. Admittedly, these fixes haven’t really been tested yet. But the
opportunity is real.
Per Camera Setup
The second system to run is ClearPerCameraFlagsSystem
. And by this point, it
should be pretty obvious what this system does. It clears the per-camera flag on
the byte-sized component attached to skinned meshes.
What is less obvious is that it also assigns its GlobalSystemVersion
to an ICD
on the worldBlackboardEntity
. This allows future systems to detect changes to
the flags relative to this system.
Frustum Culling and Skinning
SkeletonFrustumCullingAndSkinningDispatchSystem
should probably be split into
multiple systems. But for now, it exists as a mega system and does many things.
First, it performs culling on exposed bones. Bones which are visible mark a bit
in a per-thread UnsafeBitArray
associated with their exposed skeleton culling
index, which was set up by the binding reactive system. For 10k skeletons, each
thread has 1.3 kB of bits to potentially mark. This easily fits in L1 cache
making these random-access marks extremely cheap. And that’s important, because
these marks have to happen for each bone. For a skeleton that is visible, nearly
all bones in it will have to mark their visibility.
After this job, the UnsafeBitArrays
get collapsed into one using bitwise-OR
operations.
The third job is CullAndCollectMeshMetadataJob
. This job identifies visible
skeletons and queues up skinning metadata commands. For exposed skeletons, it
uses the UnsafeBitArray
to determine visibility. For optimized skeletons, it
performs frustum culling directly on the optimized skeleton bounds. For each
exposed skeleton, it looks up all mesh flags and checks their state. Meshes yet
to be skinned have the skeleton index in chunk and the mesh dynamic buffer index
stored in a NativeStream
along with the mesh’s Compute Deform material
property index. For all visible meshes with enabled LODs, flags are updated to
mark the mesh as visible per frame and per camera. The job also stores counts
relative to the number of meshes requiring skinning as well as which bone matrix
buffer to use.
The fourth and fifth jobs prefix sum the counts so that a metadata compute buffer can be populated with the correct offsets in parallel.
The sixth job is WriteBuffersJob
, and it calculates the skin matrices and
writes them to a compute buffer. It also writes the metadata buffer for
specifying offsets to the skeletons and meshes the compute shader needs to
index. All of this data has either been calculated via the prefix sums or was
cached in the skeleton. For exposed skeletons, this job pays the cost of
random-accessing the LocalToWorld
of all the individual bones, but it only
does so for skeletons that require skinning.
While all these jobs are running, the main thread stays busy juggling collection components, dispatching mesh uploads, and priming the compute skinning shader. Finally, it force-completes all the jobs so that it can end the writes to the compute buffers and dispatch the compute skinning shader. It finally binds the deformed data to the global buffer.
Unskinned LODs Update
Because skinning requires force-completing jobs, unskinned LODs update in a separate system after skinning dispatch. This is the same as the skinning LOD update, except there are no skinning-specific flags to update.
Batch Renderer Group Visibilities
The last system to run is CullUnskinnedRenderersAndUpdateVisibilitiesSystem
.
This system does the frustum culling logic that the Hybrid Renderer normally
does. However, it only does this check for the unskinned meshes. The Hybrid
Renderer updates the batch renderer group visibilities in the frustum culling
job, so this job does the same. However, for skinned meshes, instead of chunk
bounds and world bounds test, this job uses the change version since the flag
clearing and then checks the flags for chunks that changed.
One other aspect to this system is that it uses temp memory allocated in the job for scratch indices rather than allocate all the memory prior to the job. This is a little faster.
The Batch Skinning Shader
With all the systems discussed, there’s only one piece left in the puzzle, and that is the compute shader. This thing is weird, but fast. It takes advantage of two aspects common to most GPUs.
First, many GPUs have dedicated on-chip memory which maps to HLSL’s
groupshared
variables. In DX11, this memory is required to be 32 kB per thread
group. That is enough to fit 682 bone matrices. So as an optimization, each
skeleton is assigned a threadgroup. Skeletons with less than 683 bones will
store the matrices in groupshared
memory at the beginning of the shader and
then run a group barrier. After this, the thread group processes all meshes
bound to the skeleton, using the cached matrices for fast lookups. This
optimization is especially fast on AMD and Nvidia GPUs. But modern Adreno
(Qualcomm) GPUs, PowerVR GPUs (older iOS devices), and Apple designed GPUs
(modern iOS and Apple Silicon) all have this hardware. Even very recent intel
integrated graphics adopted this on-chip memory in their GEN11 architecture.
That just leaves ARM Mali and older Intel integrated GPUs where this
optimization doesn’t work. But that can be detected at runtime so Kinemation can
use a slightly different shader for those cases. This isn’t implemented yet in
the prototype. One other caveat is that the threadgroup size is set to 1024
threads, which is optimal for desktop GPUs but might be too high for mobile.
Fortunately, the shader can be easily tweaked to work with smaller thread
groups.
The second optimization is with how bone weights are stored. Traditionally, all bone weights for a given vertex are stored adjacent in memory. A second buffer specifies the starts and counts for the weights. There are two problems with this. The first is that we need two buffers just for weights. The second is that this thrashes GPU caches. Unlike CPU caches, the GPU caches are designed for throughput and coalesce reads and writes across multiple GPU threads. In other words, data locality is emphasized and temporal locality is not really a thing. A batch of threads in a thread group work in lockstep for evaluating the first weight, and then the second, and then the third, and so on. So naturally, the GPU could benefit if all the first weights for a batch of threads were packed together, and then the second, and so on. And so that is how Kinemation stores the weights. It steals the high bits of the bone index to encode linked-list offsets to the next weights for a given vertex. It also uses a negative weight to signal that it is the last weight. And batches of vertices are prefixed with batch linked list to help threads along. Much of this logic relies on a shared threadgroup scalar processor which many GPUs also have and get further speedups from. This is known as “Scalarization”.
My personal GPU is an R9 390, which is very similar to the GPU in the PS4 Pro.
One thing about these GCN GPU architectures is that to get good GPU utilization
with max groupshared
memory usage, only 32 vector registers (VGPRs) can be
used. Unlike Unity’s compute shader which uses 36, this one uses between 26 and
28. To be fair, Unity uses smaller threadgroup sizes where this doesn’t matter
as much. But still, achieving high utilization is possible, even when processing
multiple meshes in a single threadgroup.
One last thing, each bone weight has 7 unused bits that I don’t know what to do with. Any ideas?
Performance
Regarding performance, while I made a strong effort to keep CPU costs down as much as possible, my goal was to improve GPU performance.
GPU Performance
The benchmark only involves one skinned mesh per skeleton with that mesh being shared by all entities. This is best-case-scenario for the Hybrid Renderer skinning. Yet despite this, between 1k and 10k dancers, the simulation will become GPU-bound and the main thread will block on compute buffer uploads. At 10k dancers, frames take between 25 and 30 milliseconds on my system.
But with Kinemation, the frustum culling significantly reduces the GPU load. 10k dancers are CPU-bound running well-above 60 FPS. And Windows reports GPU utilization at only 50% compared to the 100% with the Hybrid Renderer. The Hybrid Renderer does perform frustum culling on the mesh instances, so this difference is entirely from compute skinning.
I also tested a slightly modified simulation where frustum culling was disabled to compare raw skinning performance. Kinemation skinning is 1.5x faster.
CPU Performance
This is where things get tricky. When not GPU-bound, Kinemation performs worse. But that’s not because the algorithms are slower. It is because Kinemation is doing a lot more work per frame. It is doing bounds updates for all the bones in the skeleton, which adds up to 220k bones. It also endures a significant constant overhead in the culling callbacks since it runs multiple systems in there.
One possible optimization may be to replace the exposed bone culling algorithm to use bounding spheres instead of bounding boxes.
Another optimization would be running more main-thread logic in Burst.
Lastly, dynamic buffer capacities were left at default values. These should probably be tuned to reflect common use cases.
In any case, if you have ideas to further improve performance, please reach out to me!
What’s Next?
There is quite a bit of work remaining before Kinemation is ready for release. Here are the planned changes:
- Basic animation clip support
- Skinning shader variants for up to 32k bones
- Skinning shader variants for specific GPUs
- Full renaming pass to improve on API consistency
- Properly ordered systems and an installation mechanism
- Removal of the Hybrid Renderer skinning mode
- Support for more prefab configurations
- SubScene and optimization support
- Exported bones using
LocalToParent
instead ofLocalToWorld
- Project configuration validation
- Documentation
Licenses
Kinemation and Latios Framework are licensed under the Unity Companion License.
Everything else including all assets in Assets/_Code and Assets/Scenes and Prefabs is CC0.