trafficstars

Objective

Fixes #4697. Hierarchical propagation of properties, currently only Transform -> GlobalTransform, can be a very expensive operation. Transform propagation is a strict dependency for anything positioned in world-space. In large worlds, this can take quite a bit of time, so limiting it to a single thread can result in poor CPU utilization as it bottlenecks the rest of the frame's systems.

Solution

Move transforms without a parent or a child (free-floating (Global)Transform) entities into a separate single-threaded pipeline.
Chunk the hierarchy based on the root entities and process it in parallel with Query::par_for_each_mut.
Utilize the hierarchy's specific properties introduced in #4717 to allow for safe use of Query::get_unchecked on multiple threads. Assuming each child is unique in the hierarchy, it is impossible to have an aliased &mut GlobalTransform so long as we verify that the parent for a child is the same one propagated from.

TODO: Benchmark

Changelog

Removed: transform_propagate_system is no longer pub.

May 16 '22 22:05 james7132

Until Bevy has a way to adapt batch size at runtime, I'm not very confident in using par_for_each or par_for_each_mut in the engine

May 17 '22 00:05 mockersf

Until Bevy has a way to adapt batch size at runtime, I'm not very confident in using par_for_each or par_for_each_mut in the engine

While having a static batch size is insufficient in the general case, par_for_each(_mut) does parallelize on archetype already. As in, even if the archetype has less than a given batch size, it's not clustered together with the other archetypes and gets a task of it's own while parallelizing. Batch size, in reality, only affects the "wide" archetypes with a sizable number of entities. Even without tuning too heavily, just switching to par_for_each(_mut) will help spread the load for systems with heavily fragmented archetypes, which we will definitely see with a component as common as Transform.

May 17 '22 01:05 james7132

Changed the batch size to 1 due to the top level roots being few in number and deeper hierarchies the work per task inconsistent and heavy.

Results are looking good. Profiled this against many_foxes and saw a 4x speedup.

Jun 02 '22 18:06 james7132

I parallelized the parentless case at the request of @aevyrie. It should also benefit the many_cubes stress test where there's quite a few of these parentless transforms.

Nov 21 '22 00:11 james7132

@james7132 can you rebase this to make relative performance testing easier? Someone has made a large number of relevant performance changes since this PR was made.

Nov 21 '22 12:11 alice-i-cecile

Build failed:

run-examples

Nov 21 '22 12:11 bors[bot]

Personally, I don't think having a comment mid declaration is a good idea, and it's not necessary here.

Unfortunately, the clippy lint won't shut up unless I do this.

bors retry

Nov 21 '22 18:11 james7132

Pull request successfully merged into main.

Build succeeded:

Nov 21 '22 18:11 bors[bot]

bevy
bevy copied to clipboard

Parallelized transform propagation

Objective

Solution

Changelog

bevy bevy copied to clipboard

Parallelized transform propagation

Objective

Solution

Changelog

bevy
bevy copied to clipboard