bevy
bevy copied to clipboard
MultithreadedExecutor bottlenecking at 1000+ Systems
Bevy version
0.12.1
Relevant system information
- CPU: AMD Ryzen Threadripper 3970X 32-Core Processor 3.69 GHz
- RAM: 32GB
What you did
Hello! I have a use-case that essentially involves separating identical groups of entities. Since Bevy's subworld support is not complete and Bevy does not have shared components (like Unity DOTS), I opted for a solution where I use Rust generics to "duplicate" my systems for every group with a SpareSet
marker component. So Marker::<0>, Marker::<1>, ...
components and SystemA::<0>, SystemA::<1>, ...
systems. The idea was the separate systems/marker components will allow Bevy to properly parallelize logic across groups since there are no cross-group dependencies.
What went wrong
It seems Bevy is bottlenecked by the number of systems for my use-case. Attempting 6000 systems (2000 groups, 3 systems/group) results in 7% CPU utilization with 12 FPS. A Tracy capture indicates that 80+% of the CPU time is spent in the multithreaded executor
before sending tasks to my thread pool.
I have created a Github with the capture and code https://github.com/UsaidPro/BevyLotsOfSystems
I was hoping Bevy would distribute the systems across the full thread pool provided by my 32-core CPU. However, instead what happens is 1 core gets consumed by the multithreaded executor
which does distribute the tasks across all threads (I see 55+ thread pools in Tracy) but only after taking ~60+ms (80+% of compute time). The multithreaded executor has MTPC of 470us, but it is called 17k times compared to 129 Update calls resulting in 83% of time spent in the single thread.
Here is a table of what systems vs FPS. All these used only 7% of my CPU, same bottleneck. I have 3 systems, 1 of them only runs if run_if()
returned true.
Groups | Concurrent Systems | Conditional Systems | FPS |
---|---|---|---|
2000 | 4000 | 2000 | 12 |
1000 | 2000 | 1000 | 40 |
500 | 1000 | 500 | 60 |
Additional information
Tracy screenshot:
- Tracy capture
-
Github with code. Uses Bevy Rapier3D, which does not seem to be related to this issue.
- Line in Github where you can set the # of groups you want
- This code may be used as a stress-test for Bevy's scheduler handling lots of systems. Can raise a PR if it would be useful.
Using a headless application is where this can become really obvious fwiw, I noticed the same issues as soon as schedule v3 was merged. see: https://discord.com/channels/691052431525675048/692572690833473578/1115422818012762274
Here are FPS comparisons between Bevy versions 0.9.1 and 0.12.1. I was told in the Discord I should test with LTO enabled, but cannot test 0.12.1 with LTO enabled due to an issue.
# of groups | # of Systems | 0.12.1 FPS | 0.9.1 FPS | 0.9.1 LTO |
---|---|---|---|---|
2000 | 6000 | 13.211003 | 15.001518 | 14.079012 |
1000 | 3000 | 19.089099 | 43.316860 | 30.000613 |
500 groups = 1500 FPS is 60+ FPS for all 3.
Interestingly, LTO enabling reduces FPS for 0.9.1. Not sure why.
I'm very curious what use case you have where you need that many systems, but this makes plenty of sense given that the executor cannot schedule systems fast enough if they're all terminate quickly. There are options like #8304 that has been thrown around, but I'm pretty sure that the contention introduced by it will be on par if not worse than what we see here.
He was running a reinforcement learning simulation and used const generic systems as group markers. His use case would be solved by bevyengine/rfcs#16
#12990 should reduce the overhead by a large amount. Could you test out that PR and see if it works out for you?
With that said, I just opened the provided trace and noticed that the bottleneck may actually be running run conditions, which are all run inline in the multithreaded executor, and the costs of making new spans for them while profiling. In this particular case where the cost of running a system and the run condition are very small, it may actually be better just to embed an early return in the system than to add a run condition.
I tested removing the run_if
and adding an early return, but I saw no difference.
I also made a fork, upgraded to latest bevy and tried to use #12990 but bevy_rapier
is incompatible with this version.
I could remove bevy_rapier
and add some dumb expensive calculations, but this doesn't seems to be a good use case. I guess I'll try to switch from bevy_rapier
to bevy_xpbd
, since it's more likely to be compatible with latest bevy. I'll do it maybe tomorrow.