bevy
bevy copied to clipboard
Parallelized transform propagation
Objective
Fixes #4697. Hierarchical propagation of properties, currently only Transform -> GlobalTransform, can be a very expensive operation. Transform propagation is a strict dependency for anything positioned in world-space. In large worlds, this can take quite a bit of time, so limiting it to a single thread can result in poor CPU utilization as it bottlenecks the rest of the frame's systems.
Solution
- Move transforms without a parent or a child (free-floating (Global)Transform) entities into a separate single-threaded pipeline.
- Chunk the hierarchy based on the root entities and process it in parallel with
Query::par_for_each_mut. - Utilize the hierarchy's specific properties introduced in #4717 to allow for safe use of
Query::get_uncheckedon multiple threads. Assuming each child is unique in the hierarchy, it is impossible to have an aliased&mut GlobalTransformso long as we verify that the parent for a child is the same one propagated from.
TODO: Benchmark
Changelog
Removed: transform_propagate_system is no longer pub.
Until Bevy has a way to adapt batch size at runtime, I'm not very confident in using par_for_each or par_for_each_mut in the engine
Until Bevy has a way to adapt batch size at runtime, I'm not very confident in using
par_for_eachorpar_for_each_mutin the engine
While having a static batch size is insufficient in the general case, par_for_each(_mut) does parallelize on archetype already. As in, even if the archetype has less than a given batch size, it's not clustered together with the other archetypes and gets a task of it's own while parallelizing. Batch size, in reality, only affects the "wide" archetypes with a sizable number of entities. Even without tuning too heavily, just switching to par_for_each(_mut) will help spread the load for systems with heavily fragmented archetypes, which we will definitely see with a component as common as Transform.
Changed the batch size to 1 due to the top level roots being few in number and deeper hierarchies the work per task inconsistent and heavy.
Results are looking good. Profiled this against many_foxes and saw a 4x speedup.

I parallelized the parentless case at the request of @aevyrie. It should also benefit the many_cubes stress test where there's quite a few of these parentless transforms.
@james7132 can you rebase this to make relative performance testing easier? Someone has made a large number of relevant performance changes since this PR was made.
Personally, I don't think having a comment mid declaration is a good idea, and it's not necessary here.
Unfortunately, the clippy lint won't shut up unless I do this.
bors retry
Pull request successfully merged into main.
Build succeeded:
- build-and-install-on-iOS
- build-android
- build (macos-latest)
- build (ubuntu-latest)
- build-wasm
- build (windows-latest)
- build-without-default-features (bevy)
- build-without-default-features (bevy_ecs)
- build-without-default-features (bevy_reflect)
- check-compiles
- check-doc
- check-missing-examples-in-docs
- ci
- markdownlint
- run-examples
- run-examples-on-wasm
- run-examples-on-windows-dx12