bevy
bevy copied to clipboard
Improve Queue Phase parallelization and other small optimizations
Objective
A mutable reference to PipelineCache prevents a good chunk of the systems in queue from parallelizing with each other, even though it's primary use with Specialized*Pipelines<T> should rarely require mutation. Likewise &mut RenderPhase<I> on each render phase also requires exclusive access, which prohibits multiple systems from enqueuing phase items at the same time.
Solution
Selectively leverage internal mutability and thread-local in a way that avoids adding per-entity overheads.
RenderPhase
- Use
ThreadLocalinside RenderPhase to create thread local queues of phase items. - Add a
phase_scopefor enqueuing on each thread separately without locking overhead. - Collect all of the thread-local phase items into one vec before sorting.
- Shrink the size of IDs used to reduce the amount of memory being shuffled around.
- Use
Vec::sort_unstable_by_keyfor a slight sorting speedup. - Change all of the
&mut RenderPhase<T>to&RenderPhase<T>, allow for increased parallelism. - Can we just shove all of these into a
std::collections::BinaryHeapand drain?
PipelineCache
- Introduce
LockablePipelineCache, a wrapper aroundRwLock<PipelineCache>. - Change
Specialized*Pipelinesto take a&LockablePipelineCacheinstead of&mut PipelineCache. Update systems to match. - Introduce two systems
lock_pipeline_cache(in Extract) andunlock_pipeline_cache(in PhaseSort) that take the existingPipelineCacheresource and wraps it into aLockablePipelineCacheand vice versa. This allows Render phase draw functions to read from the cache without any contention, at the cost of one small command buffer.
Performance
I tested this on the default configuration of many_foxes, which has several heavy queue systems.
The direct effects are as expected. The queue phase results show a 30% speedup on my machine due to the increased parallelism. (yellow is this PR, red is main)

For sort phase, there is a slight regression due to the additional copy into the sorted vec before sorting. This is slightly alleviated by shrinking the draw function type sizes. This can be further addressed by using more optimized sorting algorithms (i.e. voracious).

Overall, this sees a rough 0.3ms improvement (2 FPS, 73 -> 75) improvement on my machine.

Future Work
The changes to enable internal mutability in a thread-safe manner can be extended to also allow internal parallelism in heavy queue tasks.
Changelog
TODO
Migration Guide
TODO