bevy MultithreadedExecutor bottlenecking at 1000+ Systems

Bevy version

0.12.1

Relevant system information

CPU: AMD Ryzen Threadripper 3970X 32-Core Processor 3.69 GHz
RAM: 32GB

What you did

Hello! I have a use-case that essentially involves separating identical groups of entities. Since Bevy's subworld support is not complete and Bevy does not have shared components (like Unity DOTS), I opted for a solution where I use Rust generics to "duplicate" my systems for every group with a SpareSet marker component. So Marker::<0>, Marker::<1>, ... components and SystemA::<0>, SystemA::<1>, ... systems. The idea was the separate systems/marker components will allow Bevy to properly parallelize logic across groups since there are no cross-group dependencies.

What went wrong

It seems Bevy is bottlenecked by the number of systems for my use-case. Attempting 6000 systems (2000 groups, 3 systems/group) results in 7% CPU utilization with 12 FPS. A Tracy capture indicates that 80+% of the CPU time is spent in the multithreaded executor before sending tasks to my thread pool. I have created a Github with the capture and code https://github.com/UsaidPro/BevyLotsOfSystems

I was hoping Bevy would distribute the systems across the full thread pool provided by my 32-core CPU. However, instead what happens is 1 core gets consumed by the multithreaded executor which does distribute the tasks across all threads (I see 55+ thread pools in Tracy) but only after taking ~60+ms (80+% of compute time). The multithreaded executor has MTPC of 470us, but it is called 17k times compared to 129 Update calls resulting in 83% of time spent in the single thread.

Here is a table of what systems vs FPS. All these used only 7% of my CPU, same bottleneck. I have 3 systems, 1 of them only runs if run_if() returned true.

Groups	Concurrent Systems	Conditional Systems	FPS
2000	4000	2000	12
1000	2000	1000	40
500	1000	500	60

Additional information

Tracy screenshot: tracy_screenshot

Tracy capture
Github with code. Uses Bevy Rapier3D, which does not seem to be related to this issue.
- Line in Github where you can set the # of groups you want
- This code may be used as a stress-test for Bevy's scheduler handling lots of systems. Can raise a PR if it would be useful.

Jan 17 '24 02:01 UsaidPro

Using a headless application is where this can become really obvious fwiw, I noticed the same issues as soon as schedule v3 was merged. see: https://discord.com/channels/691052431525675048/692572690833473578/1115422818012762274

Jan 17 '24 03:01 AxiomaticSemantics

Here are FPS comparisons between Bevy versions 0.9.1 and 0.12.1. I was told in the Discord I should test with LTO enabled, but cannot test 0.12.1 with LTO enabled due to an issue.

# of groups	# of Systems	0.12.1 FPS	0.9.1 FPS	0.9.1 LTO
2000	6000	13.211003	15.001518	14.079012
1000	3000	19.089099	43.316860	30.000613

500 groups = 1500 FPS is 60+ FPS for all 3.

Interestingly, LTO enabling reduces FPS for 0.9.1. Not sure why.

Jan 21 '24 01:01 UsaidPro

I'm very curious what use case you have where you need that many systems, but this makes plenty of sense given that the executor cannot schedule systems fast enough if they're all terminate quickly. There are options like #8304 that has been thrown around, but I'm pretty sure that the contention introduced by it will be on par if not worse than what we see here.

Feb 09 '24 06:02 james7132

He was running a reinforcement learning simulation and used const generic systems as group markers. His use case would be solved by bevyengine/rfcs#16

Mar 29 '24 15:03 s-puig

#12990 should reduce the overhead by a large amount. Could you test out that PR and see if it works out for you?

Apr 16 '24 06:04 james7132

With that said, I just opened the provided trace and noticed that the bottleneck may actually be running run conditions, which are all run inline in the multithreaded executor, and the costs of making new spans for them while profiling. In this particular case where the cost of running a system and the run condition are very small, it may actually be better just to embed an early return in the system than to add a run condition.

Apr 16 '24 07:04 james7132

I tested removing the run_if and adding an early return, but I saw no difference.

I also made a fork, upgraded to latest bevy and tried to use #12990 but bevy_rapier is incompatible with this version.

I could remove bevy_rapier and add some dumb expensive calculations, but this doesn't seems to be a good use case. I guess I'll try to switch from bevy_rapier to bevy_xpbd, since it's more likely to be compatible with latest bevy. I'll do it maybe tomorrow.

Apr 17 '24 12:04 afonsolage

bevy bevy copied to clipboard

MultithreadedExecutor bottlenecking at 1000+ Systems

Bevy version

Relevant system information

What you did

What went wrong

Additional information

bevy
bevy copied to clipboard