bevy icon indicating copy to clipboard operation
bevy copied to clipboard

Improve Queue Phase parallelization and other small optimizations

Open james7132 opened this issue 3 years ago • 0 comments

Objective

A mutable reference to PipelineCache prevents a good chunk of the systems in queue from parallelizing with each other, even though it's primary use with Specialized*Pipelines<T> should rarely require mutation. Likewise &mut RenderPhase<I> on each render phase also requires exclusive access, which prohibits multiple systems from enqueuing phase items at the same time.

Solution

Selectively leverage internal mutability and thread-local in a way that avoids adding per-entity overheads.

RenderPhase

  • Use ThreadLocal inside RenderPhase to create thread local queues of phase items.
  • Add a phase_scope for enqueuing on each thread separately without locking overhead.
  • Collect all of the thread-local phase items into one vec before sorting.
  • Shrink the size of IDs used to reduce the amount of memory being shuffled around.
  • Use Vec::sort_unstable_by_key for a slight sorting speedup.
  • Change all of the &mut RenderPhase<T> to &RenderPhase<T>, allow for increased parallelism.
  • Can we just shove all of these into a std::collections::BinaryHeap and drain?

PipelineCache

  • Introduce LockablePipelineCache, a wrapper around RwLock<PipelineCache>.
  • Change Specialized*Pipelines to take a &LockablePipelineCache instead of &mut PipelineCache. Update systems to match.
  • Introduce two systems lock_pipeline_cache (in Extract) and unlock_pipeline_cache (in PhaseSort) that take the existing PipelineCache resource and wraps it into a LockablePipelineCache and vice versa. This allows Render phase draw functions to read from the cache without any contention, at the cost of one small command buffer.

Performance

I tested this on the default configuration of many_foxes, which has several heavy queue systems.

The direct effects are as expected. The queue phase results show a 30% speedup on my machine due to the increased parallelism. (yellow is this PR, red is main)

image

For sort phase, there is a slight regression due to the additional copy into the sorted vec before sorting. This is slightly alleviated by shrinking the draw function type sizes. This can be further addressed by using more optimized sorting algorithms (i.e. voracious).

image

Overall, this sees a rough 0.3ms improvement (2 FPS, 73 -> 75) improvement on my machine.

image

Future Work

The changes to enable internal mutability in a thread-safe manner can be extended to also allow internal parallelism in heavy queue tasks.


Changelog

TODO

Migration Guide

TODO

james7132 avatar Jun 02 '22 06:06 james7132