wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

Very High CPU Usage with wgpu >= 0.20.0

Open myth0genesis opened this issue 1 year ago • 17 comments

Description There seems to be very high single-core CPU usage in versions of wgpu 0.20.0 and later.

Repro steps

  • Run an example that requires regular redraws (for example, skybox) in any version of wgpu before 0.20.0 and observe CPU usage.
  • Run the same example in wgpu 0.20.0 or later and observe much higher CPU usage.

Expected vs observed behavior Attached is a video where I run the skybox example provided in the wgpu repo first with version 0.19.4, and then I run the same example again with wgpu 0.20.2, and then I run them both in the same order once again while keeping a CPU monitor open to observe the effects on CPU usage. Single-core CPU usage spikes to at or near 100% with wgpu 0.20.2 (and later). However, CPU usage with the 'skybox' example for wgpu 0.19.4 and earlier are at near idle levels.

Extra materials The video I mention is attached below, as well as the first page of the perf reports for the skybox example for both wgpu 0.19.4 and wgpu 0.20.2 in human-readable plaintext format. High_CPU_wgpu.webm Perf_Report_wgpu-0.19.4.txt Perf_Report_wgpu-0.20.2.txt

Platform Operating System: Kubuntu 24.04 KDE Plasma Version: 5.27.11 KDE Frameworks Version: 5.115.0 Qt Version: 5.15.13 Kernel Version: 6.8.0-45-generic (64-bit) Graphics Platform: X11 Processors: 12 × 12th Gen Intel® Core™ i9-12900H Memory: 31.1 GiB of RAM Graphics Processor: NVIDIA GeForce RTX 3070 Ti Laptop GPU/PCIe/SSE2 System Version: REV:1.0

myth0genesis avatar Sep 28 '24 18:09 myth0genesis

My suspicion is that this is related to the Vulkan backend. I've built and run some of the Vulkano examples from their most recent release and noticed the top offenders also seem to be Mutexes and printfs similar to what the perf report in wgpu 0.20.0 and later show.

myth0genesis avatar Sep 28 '24 18:09 myth0genesis

Very likely the same as

  • https://github.com/emilk/egui/issues/5092
  • https://github.com/gfx-rs/wgpu/issues/5756

Can you try what happens when you disable validation? That's automatically the case in Release, for Debug it's opt out

Wumpf avatar Sep 28 '24 18:09 Wumpf

I've already tried the fixes suggested in those issues, which is what prompted me to start this new one. I may have bungled the disabling validation bit because I'm not sure if that's the correct way to do it. I tried running the example setting the environment variable, as well as following the instructions here, where I commented out the relevant lines in wgpu/wgpu-hal/src/vulkan/instance.rs. Running in Release still has the same high CPU usage (though sometimes it's slightly lower than without WGPU_VALIDATION=0 and the perf report shows the same top offenders. But either way, the CPU usage, which is sometimes at 75% and sometimes at 100%, is still orders of magnitude higher than older versions of wgpu. Attached is a video showing the CPU usage when running in Release and turning off validation via the environment variable.

High_CPU_wgpu_No_Validation.webm

myth0genesis avatar Sep 28 '24 19:09 myth0genesis

thanks for follow-up! Bit of relevant context that I know of: from 0.19 to 0.20 there was a bunch of fixed that landed to how synchronization is done on Vulkan - in fact it was pretty bugged before. That also matches up well with the perf logs you attached:

9.57%  wgpu-examples    libc.so.6                          [.] pthread_mutex_lock@@GLIBC_2.2.5
8.79%  wgpu-examples    libc.so.6                          [.] pthread_mutex_unlock@@GLIBC_2.2.5

Looks like the internal spinning optimization of libc is now hit hard 🤔 (afaik libc first spins a bit before doing the syscalls to yield to the scheduler) Bunch of next investigation steps I can think of:

  • which ones of the involved locks are the ones that cause so much cpu usage instead of yielding
    • if the "it's spinning locks"-hypothesis is right: would a lower level non spinning lock implementation not have this issue (likely causing unwanted delays)
  • can we see this also on Vulkan on Windows, form hypothesis why not and try to test them
  • is the Vulkan synchronization still wrong after all and if it isn't, how do other applications/samples usually handle this

Wumpf avatar Sep 29 '24 08:09 Wumpf

I appreciate the quick response. I don't know for sure if it's something exclusive to wgpu. It's not an apples-to-apples comparison, as I don't know enough about Vulkan to understand how frame pacing works in any meaningful detail and the examples are obviously not the same, but I ran the triangle example from the most recent release of Vulkano and there was high CPU usage there, too. I've attached the first page of the perf report here and you might be interested to see the list of top offenders looks very familiar. Perf_Report_Vulkano.txt

myth0genesis avatar Sep 29 '24 17:09 myth0genesis

Okay. Scratch that last comment. I no longer think it's to do with the Rust Vulkan bindings. I should've looked beforehand, but I just today learned wgpu uses the Ash Vulkan bindings. So I ran the triangle example in the version of Ash that was first present in wgpu 0.20.0, 0.37.1, and no high CPU usage was observed. Attached is a video showing the results:

CPU_Usage_Ash.webm

myth0genesis avatar Sep 29 '24 20:09 myth0genesis

Very likely the same as

* [100% CPU usage since the update to wgpu 0.20 emilk/egui#5092](https://github.com/emilk/egui/issues/5092)

* [Debug Performance Regressions on 0.20 Vulkan #5756](https://github.com/gfx-rs/wgpu/issues/5756)

Possibly also related:

I'm also seeing high CPU usage, but on something that was already CPU bound. So I see it as a frame rate drop and jank. The symptoms are different but the underlying cause may be related. This is on Vulkan, too.

Has anyone run Tracy on this high CPU usage problem? When I do that, I see that too much time is going into binding. But I didn't run sample-type profiling, which would tell whether too much time is being lost in lock primitives.

User-space spinlocks are prone to something called futex congestion collapse in multi-thread programs. But that shouldn't happen when you have fewer threads than CPUs. What's going on with these simple benchmarks in terms of threads vs. CPUs?

John-Nagle avatar Nov 06 '24 19:11 John-Nagle

As mentioned above, I already checked CPU usage when disabling validation and when running in release and the issue still remained, as shown in the attached videos, so I'm doubtful it's related to those issues. As far as the WGPU 0.22.+ issue goes, this was a problem present for me post 0.20.0, which you can also see in the comparison videos I posted. I also recently ran the examples in Arch on X11 using a different scheduler and the problem is still there. I wasn't able to give a comprehensive performance log previously because of how slow Ubuntu is with compiling the reports, but it shouldn't be so bad now that I'm on Arch. I'll try to take some samples and post the results here in the next week or so.

myth0genesis avatar Nov 06 '24 20:11 myth0genesis

Any updates on this? Even running in release mode (which according to the other issues, should fix this) the CPU is still pinned at 100%

coredoesdev avatar Dec 09 '24 02:12 coredoesdev

I think our main blocker is getting some more information - the hotspot information is useful, but does not really explain how we're hitting 100% on a single core when we should be blocking on vsync.

cwfitzgerald avatar Dec 09 '24 04:12 cwfitzgerald

Got it. Anything I can provide to be helpful? I've got a super minimal (just a textured quad moving around) setup, around 300 lines, so I can grab some perf data if it's helpful

coredoesdev avatar Dec 09 '24 23:12 coredoesdev

Alternatively, is there a way around it for the meantime? It's locking up my entire machine every time I start my program.. it's unsustainable. GL apparently just doesn't work on nvidia cards, otherwise I'd use that

coredoesdev avatar Dec 10 '24 00:12 coredoesdev

Interesting thing: Got a buddy running voidlinux to run this, and it got normal CPU usage (2-5%). Could this be some strange interaction with systemd somehow? It wouldn't make sense, but it's the primary difference between his setup and the others posted in this thread

coredoesdev avatar Dec 10 '24 01:12 coredoesdev

Probably a different problem than I reported. I'm drawing enough stuff that I hit 100% CPU utilization on the render thread with the GPU only 25% loaded, but that's because I'm drawing a lot. If I draw less stuff, CPU usage goes down. It's all working, just not fast enough.

I've got a super minimal (just a textured quad moving around) setup

If you're hitting 100% CPU drawing that, but it still draws, maybe there's no frame rate limit and you're drawing at some insane frame rate, faster than the display.

John-Nagle avatar Dec 10 '24 03:12 John-Nagle

That's what I thought too, so I added frame time tracing.. fairly constant 60fps (within a tenth). Incidentally my nvidia driver exploded a few minutes ago and I tried again... now, it was using 350% CPU (as my other threads generating data to be rendered started to get activated), and rendering at a much higher framerate. So it does appear to be blocking on vsync, just in some wildly inefficient way. Basic profiling shows the overwhelming majority of cpu time being spent in get_time. I suspect this is tracing_subscriber at play, so I'm just ignoring that. The rest of the time is spent in various C libraries.. no debugging symbols there (probably internals of the nvidia vulkan driver). Around 25% of cpu time was spent in mutex_lock. Only around 1% of cpu time was in actual rust code interestingly, which suggests to me something in vulkan is spinning (induced by wgpu?)

I can upload a perf file tomorrow if that would be helpful.

coredoesdev avatar Dec 10 '24 03:12 coredoesdev

Just thought I'd offer an update on this once again. Apologies for the terribly delayed message and the lack of the promised perf report. I've switched to Vulkano for my renderer since the last time I posted here, but after seeing my productivity tank because of the verbosity of that API, I decided to try to build and run the skybox example in the newest version of wgpu (25.0.1) in Wayland (Hyprland), as that's what I'm using as my daily driver now, and the CPU usage was fine, at around 3-5%. So I booted up Plasma in X-11 to give it another shot, in case maybe the issue was fixed in the newest version of wgpu, but the CPU usage was very high again. It occurred to me to try to artificially limit the frame rate using MangoHud, so I tried running the example with the frame rate limited to 60 fps and the CPU usage was at a reasonable 3-5%, just like what it was in Wayland without a frame rate limiter. Tried with the frame rate limited by MangoHud in X-11 once again, but this time limiting my frame rate to 144 fps in order to rule out the possibility that the renderer running at my monitor's refresh rate (144 Hz) would still run at 100% single-core CPU usage, and the CPU usage was still at a baseline 3-5%.

Maybe this indicates an issue with V-Sync in Vulkan on X-11? The frame times reported in the console without a frame rate limiter enabled still says 144 fps, matching my monitor's refresh rate, but maybe this is only indicating the surface's presentation rate and not the actual renderer's update rate?

Summary: Running the skybox example in X-11 without a frame rate limiter, the console indicates a frame rate of 144 fps, but single-core CPU usage is at 100%. Running the same example in X-11 with the frame rate limited to 144 fps using MangoHud shows single-core CPU usage hovering at around 3-5%.

myth0genesis avatar Apr 25 '25 21:04 myth0genesis

Just bumping this to let you know I'm still experiencing this problem in X11 with 25.0.2. I was able to record a perf report in Plasma both under X11 and Wayland to hopefully make things easier to compare. Both reports are recorded over a period of ~5 seconds, but I immediately noticed the X11 report generated more than six times the number of events than the Wayland report. Here are the flags I used to set up and record the reports:

sysctl kernel.perf_event_paranoid=2
perf record -g --call-graph dwarf -F 144 ./the_program

Don't mind the name of the program. It's just a minimal triangle example that one would get following from along with learn wgpu's newest version of their tutorial. GitHub won't let me post the .data files so I just changed the extension to .txt to be able to drop them here. You'll have to change the extension back to .data to be able to view them with perf.

If you need me to set some different flags and record another set of reports, feel free to let me know.

perf_Wayland.txt perf_X11.txt

myth0genesis avatar Jun 09 '25 04:06 myth0genesis

I'm also experiencing this in very minimal repo steps. ~~My instinct is this is more on the winit side~~

  1. Open wgpu-example-02-hello-window
  2. cargo run (--release)
  3. 100% CPU usage immediately

I've tested this on Windows with Vulkan and on Linux with Vulkan and only experience it on the latter.

izzymg avatar Jul 08 '25 08:07 izzymg

My instinct is this is more on the winit side.

I just want to avoid sending the maintainers on a wild goose chase and say it's not a winit problem. I've used ash and the newer versions of Vulkano with winit on Linux on X11 without this issue (though I did experience an issue with an older version of Vulkano).

myth0genesis avatar Jul 08 '25 14:07 myth0genesis