bevy Very high CPU usage on MacOS

Summary

On MacOS, even simple Bevy apps cause very high CPU load; for example, even a blank app that just loads DefaultPlugins to render a blank window sits at ~50% CPU usage on an Apple Silicon M1. Add a couple of rendered meshes and postprocessing and it'll quickly go up to about ~90% CPU.

On Windows and Linux, the same app will comfortable sit below 1% (even with some light rendering.)

Additional notes:

A MRP has been provided here.
It uses the opt-level = 3 etc. trick from the docs.
Compiling a --release build doesn't make a difference.
In addition, when hiding the app, CPU load doubles, but I understand there have already been issues filed about this particular problem.

Bevy version

0.11.3 (but I also tried 66f72dd, the newest commit on main at the time.)

Relevant system information

Rust 1.73.0
MacOS Sonoma 14.0

AdapterInfo { name: "Apple M1", vendor: 0, device: 0, device_type: IntegratedGpu, driver: "", driver_info: "", backend: Metal }
SystemInfo { os: "MacOS 14.0 ", kernel: "23.0.0", cpu: "Apple M1", core_count: "8", memory: "16.0 GiB" }

What you did

Build a "blank" Bevy app (code available here)
Compile and run it

What went wrong

Expected behavior: the app should cause CPU load in the single-digit percentage range, like it does on Windows (<1% CPU, on a Ryzen 7.)

Actual behavior: the app caused ~50% CPU load.

Oct 25 '23 18:10 hmans

This is a long-standing bug, but it's nice to see it seemingly isolated to Mac. Could we see a tracing log? https://github.com/bevyengine/bevy/blob/main/docs/profiling.md

I'd also like to try stripping out even more things from the DefaultPlugins, and see if we can reduce this to one or two offending plugins. Unfortunately I don't have access to a Mac to test this myself, sorry.

Oct 25 '23 19:10 alice-i-cecile

I've created a tracing log using cargo run --release --features bevy/trace_chrome, letting run the app run for a handful of seconds, but it's already 442 MB, so I can't upload it here. :(

Is there any way I can send it to you, via Perfetto or otherwise?

Oct 25 '23 19:10 hmans

Assuming that this represents a (randomly picked) single frame, maybe this is of use to you:

Oct 25 '23 19:10 hmans

Maybe a GDrive link? Feel free to delete it after a couple days.

@superdump it looks like we're getting hit by excessively long prepare systems again 🤔

Oct 25 '23 19:10 alice-i-cecile

@alice-i-cecile - here's a Google Drive link for the trace, I hope it works for you:

https://drive.google.com/file/d/1DxPcWasjVDRJDlL1_PSQh-5qhpxEeA3p/view?usp=sharing

Oct 25 '23 19:10 hmans

Trace downloaded and open locally, thank you very much!

Pulling this up in Perfetto, I'm still just seeing extremely long Prepare systems. I wonder if the way that wgpu polls is different on Mac? Perhaps it's spinning a thread to try and vsync automatically?

Oct 25 '23 19:10 alice-i-cecile

Pulling this up in Perfetto, I'm still just seeing extremely long Prepare systems. I wonder if the way that wgpu polls is different on Mac? Perhaps it's spinning a thread to try and vsync automatically?

Not sure how helpful this is, but I did try a couple of the more simple wgpu examples (from their latest main) and didn't observe any noticeable performance issues that looked like the one I was seeing with Bevy.

But please keep in mind that I'm new to this entire ecosystem (including Rust itself), so YMMV...

Oct 25 '23 20:10 hmans

That's very helpful feedback.

This might be related to https://github.com/bevyengine/bevy/issues/9964.

Oct 25 '23 20:10 alice-i-cecile

If you reduce frame rate significantly, e.g. with WinitSettings, then CPU usage drops. That's the only 'solution' I have found. Framepacing does nothing.

Oct 25 '23 20:10 UkoeHB

Reduce frame rate to what?

Oct 25 '23 20:10 superdump

Reduce frame rate to what?

Like 1-10 FPS...

Oct 25 '23 20:10 UkoeHB

I'm trying to reduce the amount of code necessary to reproduce out on https://github.com/hmans/bevy-mrp-macos-performance/pull/1

after more looking, aren't we hitting the warning from the profiling docs (https://github.com/bevyengine/bevy/blob/main/docs/profiling.md#runtime):

When your app is bottlenecked by the GPU, you may encounter frames that have multiple prepare-set systems all taking an unusually long time to complete, and all finishing at about the same time.

?

If that's the case, I'd love a dedicated issue to that. Would adding some span help with pinpointing the exact issue ? these prepare_* functions are quite long and it's not clear to me what's the expensive operation 🤔

Oct 29 '23 08:10 ThierryBerger

Seems related to https://github.com/bevyengine/bevy/issues/5713

Oct 30 '23 19:10 ThierryBerger

I noticed that my MacBook was running hot, and I found this issue. Upon running the starting example with DefaultPlugins, I noticed an unusually high CPU usage of 170%. Through a process of elimination, I determined that the culprit behind the issue is the RenderPlugin. The example runs smoothly with normal CPU usage when this particular plugin is excluded. While I intend to delve further into this matter, my limited experience with Bevy, having used it for only a day, may limit the extent of my assistance.

Nov 11 '23 22:11 matevz-ap

So I dug into this about 5 months ago, and found the same problem somewhere in the render pipeline. The discussion & profiling results are in the discord thread below.

It looked like it may have been a contention issue in wgpu. That was about where I stopped because needed to make a wgpu-only poc to see if it was a wgpu-on-mac issue. Also this research was done on an Intel Mac.

https://discord.com/channels/691052431525675048/743663924229963868/1119076766405886004

from the discord:

I was using apple's instruments to do the profiling, and have used the builtin macos sampling. They all report that the thread is blocked in the [gpu] write path. Sampling can't really detect busy loop vs wait unless the function name gives it away. I haven't seen anything that gives away what's going on.

Nov 21 '23 21:11 dmlary

I've just checked the current latest and this still seems to be an issue (2 full cores used on a M2 Pro with even the trivial examples.) Is there any outlook on this eventually improving?

Jun 07 '24 09:06 hmans

We are interested in this improving! But this needs further investigation to pinpoint: it's still not at all clear where in the stack this is occurring. Does this occur with wgpu's examples?

Jun 07 '24 12:06 alice-i-cecile

related issue in my opinion

Jun 07 '24 14:06 adals

We are interested in this improving! But this needs further investigation to pinpoint: it's still not at all clear where in the stack this is occurring. Does this occur with wgpu's examples?

I just tried exactly this today (current HEAD there) and it seems fine. I'm happy to provide further testing or any other input that might help. Sadly I am not familiar with either codebase so I can't really provide any input there.

Jun 07 '24 16:06 hmans

Great :) I would double-check at https://github.com/gfx-rs/wgpu/releases/tag/v0.19.4 which is the version Bevy is currently on, but I would suspect that there seems to be something wrong with Bevy's implementation then.

It looks like the various Prepare steps are at fault. I'll ask the rendering experts if they might have insight there. I suspect we're spin locking while waiting to coordinate between the various steps somewhere, which is manifesting as high CPU usage without actually doing useful work.

Are you able to achieve reasonable results (more than 50k entities) on the Bevymark example?

Jun 07 '24 20:06 alice-i-cecile

@hmans could you try forcing a scale_factor of 1.0 for the window and see if it makes a difference. For some reason macos reports a really high scale factor by default.

Jun 07 '24 20:06 IceSentry

I tested with 0.11.3 and 0.14.0-rc.2 and observe the same. Using scale_factor of 1.0 makes no difference. It hovers around 40-50% CPU in Activity Monitor, top, htop.

I looked at a GPU frame trace in Xcode and it's only running the no_camera_clear_pass and using microseconds of time. Oddly it said that pass was using 10800 vertices. No idea why.

I looked at a CPU profile using Instruments and that shows no significant CPU usage. Running for 20s put CPU cycles spent for frame updates (at 120Hz) at about the same as plugin initialisation.

2024-06-08 - bevy macos default plugins cpu.trace.zip

Jun 08 '24 04:06 superdump

bevy-mrp-macos-performance.gputrace.zip Here's the GPU trace with scale factor 1.0.

Jun 08 '24 04:06 superdump

And just to note that I think tracy traces showing waiting on prepare set systems makes sense as they are blocked by waiting on vsync for a new swapchain texture.

Jun 08 '24 04:06 superdump

I've tried 0.14 and the issue is still persisting.

Since wgpu itself does not seem to have this issue, from the perspective of a naive outsider who's not familiar with either code bases, my best guess would be that wgpu uses vsync out of the box (idling until it's time to render the next frame) and Bevy doesn't (at least on macOS); however, even the "fps_overlay" example that renders literally nothing uses ~100% CPU while displaying an FPS count of 100 (on my 100Hz screen) when focused, and ~60 FPS when unfocused. So it's probably not vsync vs. unlocked.

A couple of folks on Reddit pointed me towards the (new?) WinitSettings::desktop_app(), but that isn't the solution to this issue because all it does is prevent the rendering of new frames if there is no user input. Once it does render frames, things are as described above.

Jul 05 '24 07:07 hmans

Great :) I would double-check at https://github.com/gfx-rs/wgpu/releases/tag/v0.19.4 which is the version Bevy is currently on, but I would suspect that there seems to be something wrong with Bevy's implementation then.

I know I've taken my good time with doing this (apologies), but I now got around to checking that specific version, and its cube example uses a relaxed ~19% CPU time for rendering at 120 FPS (I'm on a M2 Pro now.)

Bevy 0.14 appears to use wgpu 0.20, which exhibits the same performance.

19% still feel a little too high for what the example is doing, but it's nowhere near the 50%-200% that I'm seeing with simple Bevy examples.

As mentioned on Reddit earlier today, I'm (stupidly and naively, and very likely unsuccessfully) trying to track this issue down myself now. Just wanted to post these numbers here for reference.

Jul 06 '24 17:07 hmans

I can reproduce this by run Sprite Sheet example on m3max, it took 66-75% CPU and 34% CPU for opt-level = 1

Not looking good for just 1 sprite animation.

Sep 09 '24 10:09 katopz

Can you try with opt-level=3 or just --release? opt-level=1 is barely any optimization and isn't a good indicator of real performance.

Sep 09 '24 16:09 IceSentry

Can you try with opt-level=3 or just --release? opt-level=1 is barely any optimization and isn't a good indicator of real performance.

opt-level=1 👉 34% CPU opt-level=3 👉 36% CPU

i didn't bother --release because i on dev and my m3 fans are spinning 🔥

[profile.dev]
opt-level = 3

[profile.release]
lto = true
opt-level = 3
codegen-units = 1
incremental = false
debug = false

And the result with [profile.release] is --release 👉 34% CPU

just --release 👉 36% CPU

Actually it look same all over setting. Something really weird here.

Sep 10 '24 05:09 katopz

What are the chances this issue is related to how bevy is calling winit? On the second to last known regression mentioned here: https://github.com/bevyengine/bevy/issues/11052

Bevy's entire update loop is currently running in the about_to_wait() wininit callback; there seems to be a whole bunch of workarounds / bevy ecs events passed to make this functional. The wininit docs mention that about_to_wait() is not supposed to be used for main update loops like this, and WindowEvent::RedrawRequested should be preferred instead.

See ApplicationHandler in bevy_winit::state.rs

Oct 14 '24 23:10 destravous