bevy icon indicating copy to clipboard operation
bevy copied to clipboard

MultithreadedExecutor bottlenecking at 1000+ Systems

Open UsaidPro opened this issue 1 year ago • 3 comments

Bevy version

0.12.1

Relevant system information

  • CPU: AMD Ryzen Threadripper 3970X 32-Core Processor 3.69 GHz
  • RAM: 32GB

What you did

Hello! I have a use-case that essentially involves separating identical groups of entities. Since Bevy's subworld support is not complete and Bevy does not have shared components (like Unity DOTS), I opted for a solution where I use Rust generics to "duplicate" my systems for every group with a SpareSet marker component. So Marker::<0>, Marker::<1>, ... components and SystemA::<0>, SystemA::<1>, ... systems. The idea was the separate systems/marker components will allow Bevy to properly parallelize logic across groups since there are no cross-group dependencies.

What went wrong

It seems Bevy is bottlenecked by the number of systems for my use-case. Attempting 6000 systems (2000 groups, 3 systems/group) results in 7% CPU utilization with 12 FPS. A Tracy capture indicates that 80+% of the CPU time is spent in the multithreaded executor before sending tasks to my thread pool. I have created a Github with the capture and code https://github.com/UsaidPro/BevyLotsOfSystems

I was hoping Bevy would distribute the systems across the full thread pool provided by my 32-core CPU. However, instead what happens is 1 core gets consumed by the multithreaded executor which does distribute the tasks across all threads (I see 55+ thread pools in Tracy) but only after taking ~60+ms (80+% of compute time). The multithreaded executor has MTPC of 470us, but it is called 17k times compared to 129 Update calls resulting in 83% of time spent in the single thread.

Here is a table of what systems vs FPS. All these used only 7% of my CPU, same bottleneck. I have 3 systems, 1 of them only runs if run_if() returned true.

Groups Concurrent Systems Conditional Systems FPS
2000 4000 2000 12
1000 2000 1000 40
500 1000 500 60

Additional information

Tracy screenshot: tracy_screenshot

UsaidPro avatar Jan 17 '24 02:01 UsaidPro

Using a headless application is where this can become really obvious fwiw, I noticed the same issues as soon as schedule v3 was merged. see: https://discord.com/channels/691052431525675048/692572690833473578/1115422818012762274

AxiomaticSemantics avatar Jan 17 '24 03:01 AxiomaticSemantics

Here are FPS comparisons between Bevy versions 0.9.1 and 0.12.1. I was told in the Discord I should test with LTO enabled, but cannot test 0.12.1 with LTO enabled due to an issue.

# of groups # of Systems 0.12.1 FPS 0.9.1 FPS 0.9.1 LTO
2000 6000 13.211003 15.001518 14.079012
1000 3000 19.089099 43.316860 30.000613

500 groups = 1500 FPS is 60+ FPS for all 3.

Interestingly, LTO enabling reduces FPS for 0.9.1. Not sure why.

UsaidPro avatar Jan 21 '24 01:01 UsaidPro

I'm very curious what use case you have where you need that many systems, but this makes plenty of sense given that the executor cannot schedule systems fast enough if they're all terminate quickly. There are options like #8304 that has been thrown around, but I'm pretty sure that the contention introduced by it will be on par if not worse than what we see here.

james7132 avatar Feb 09 '24 06:02 james7132

He was running a reinforcement learning simulation and used const generic systems as group markers. His use case would be solved by bevyengine/rfcs#16

s-puig avatar Mar 29 '24 15:03 s-puig

#12990 should reduce the overhead by a large amount. Could you test out that PR and see if it works out for you?

james7132 avatar Apr 16 '24 06:04 james7132

With that said, I just opened the provided trace and noticed that the bottleneck may actually be running run conditions, which are all run inline in the multithreaded executor, and the costs of making new spans for them while profiling. In this particular case where the cost of running a system and the run condition are very small, it may actually be better just to embed an early return in the system than to add a run condition.

james7132 avatar Apr 16 '24 07:04 james7132

I tested removing the run_if and adding an early return, but I saw no difference.

I also made a fork, upgraded to latest bevy and tried to use #12990 but bevy_rapier is incompatible with this version.

I could remove bevy_rapier and add some dumb expensive calculations, but this doesn't seems to be a good use case. I guess I'll try to switch from bevy_rapier to bevy_xpbd, since it's more likely to be compatible with latest bevy. I'll do it maybe tomorrow.

afonsolage avatar Apr 17 '24 12:04 afonsolage