wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

Waiting for device is unusually slow on Metal

Open yongqli opened this issue 7 months ago • 2 comments

Submitting an empty command buffer to the queue and then polling for the result takes a minimum of 1.2 ms. This seems to be surprisingly slow considering that GPU latency is supposed to be < 10 microseconds.

main.rs:

#[tokio::main]
pub async fn main() -> anyhow::Result<()> {
    let (device, queue) = wgpu::Instance::default()
        .request_adapter(&wgpu::RequestAdapterOptions::default())
        .await?
        .request_device(
            &wgpu::DeviceDescriptor {
                label: None,
                required_features: wgpu::Features::empty(),
                required_limits: wgpu::Limits::downlevel_defaults(),
                memory_hints: wgpu::MemoryHints::Performance,
                trace: wgpu::Trace::Off,
            },
        )
        .await?;
    for _ in 0..5 {
        let _submission = queue.submit([]);
        let start_time = ::std::time::Instant::now();
        assert!(device.poll(wgpu::PollType::Wait).unwrap().wait_finished());
        println!("Polling took {:.3} ms", start_time.elapsed().as_micros() as f64 * 1e-3)
    }
    Ok(())
}
tokio = { version = "*", features = ["rt-multi-thread"] }
wgpu = { version = "25", features = ["metal", "vulkan", "wgsl"], default-features = false }

Test was performed on a Mac Studio 2023 (M2).

yongqli avatar May 21 '25 19:05 yongqli

Sounds like you're probably hitting this sleep here https://github.com/gfx-rs/wgpu/blob/fd6f16f5982bdaba49ca179c114a62d2953acf10/wgpu-hal/src/metal/device.rs#L1550 PollType::Wait is all about waiting for all scheduled gpu work to be done and synchronized back to the cpu, i.e. wait until gpu is definitely idle. Therefore, it is expected to take quite a while and should absolutely be not used to wait for some job or resource being available! But that sleep 1ms there still seems super silly to me. Busy looping is obviously not great either, but I'd much prefer the other extreme of this being just a thread yield 🤔 But really why is this not a waitUntilCompleted?

@yongqli Can you confirm that it's that sleep causing the delay for you?

considering that GPU latency is supposed to be < 10 microseconds.

what exactly is this number referring to and where does it come from? latency between what precisely?

Wumpf avatar May 21 '25 20:05 Wumpf

Yes, changing it to thread::yield_now(); now reduces the latency to 48 micros.

Now the output is:

Polling took 0.749 ms
Polling took 0.048 ms
Polling took 0.045 ms
Polling took 0.027 ms
Polling took 0.048 ms

yongqli avatar May 21 '25 20:05 yongqli

Closing in favor of #8119

cwfitzgerald avatar Aug 20 '25 21:08 cwfitzgerald