wgpu icon indicating copy to clipboard operation
wgpu copied to clipboard

Add `create_*_pipeline_async()`

Open kpreid opened this issue 1 year ago • 16 comments

The WebGPU specification includes

and their use is recommended to avoid blocking upon pipeline creation. There are currently no corresponding functions in wgpu; presumably there should be.

kpreid avatar May 22 '23 04:05 kpreid

How would we go about implementing this on vk/dx12/metal backends? A dedicated OS-thread for each pipeline?

JMS55 avatar May 27 '23 02:05 JMS55

The best solution would probably require wgpu to use a thread pool, since spawning OS threads for individual jobs like that might have some decent overhead.

If so I'd definitely prefer if wgpu had a way to disable its own thread pool. My understanding is that this is an API that is mostly useful for browser usage, since WebGPU in a browser doesn't yet have any way to do multithreading. In a native context I should just be able to call create_*_pipeline() myself from my own thread pool.

PJB3005 avatar May 28 '23 13:05 PJB3005

How is this function implemented in the browser? Do they use thread pools for this?

notgull avatar May 28 '23 14:05 notgull

I wrote the original issue simply as “here is a discrepancy with the spec” without further thought, but here are some further thoughts:

My use of wgpu targets web and desktop (as is probably desirable for many games). Therefore, I want to be able to have good performance (no event loop hiccups) on web, which means that I must use async. Conveniently, the web environment already “has an async executor”, inherently, and wgpu-on-web targets that executor (unavoidably, insofar as it has anything async at all).

This together with @PJB3005's point suggests that the solution here and for similar situations might be for wgpu to allow plugging in an executor in some fashion; to be handed a “spawn_blocking()” function that it can use when some operation is supposed to be async but this is not natively supported. That way, wgpu would not need to manage its own thread pool.

Certainly this could be done by a layer on top of wgpu, but it is annoying to have to design and maintain additional code just to get what WebGPU specifies to be possible. Also, in particular, if wgpu had an executor (or thread pool) available to it, then it could make map_async() much more straightforward to use — right now it is awkward to use in a truly async fashion without potentially missing wakeups.

kpreid avatar May 28 '23 16:05 kpreid

These would be very useful for us, we bulk create a lot of APIs and it's a significant cost on some machines to do that concurrently - especially with webgl backend. I've been tempted to try and write a hypothetical "DeviceExt::create_render_pipeline_bulk()" method, but async would solve it much better.

Dinnerbone avatar May 31 '23 19:05 Dinnerbone

We definitely don't want to utilize or make threads on our own. In line with the other async functions, a simple native implementation of this api would be calling the standard create_*_pipeline then return a already ready future.

This is a hard decision to make, as it's very hard to paper over the differences between native and web.

cwfitzgerald avatar Jun 01 '23 05:06 cwfitzgerald

So we've been musing about this same problem in the webgpu.h standardization meetings and we have come up with a possible solution that we're asking for feedback on. The rough C solution is here but I will translate this to the rust api:

type Task = Box<dyn FnOnce() + Send>;
type TaskCallback = Box<dyn Fn(Task) + Send + Sync>

// Maybe actually the device
struct InstanceDescriptor {
    ...
    // Callback which will be called when the implementation wants to do work on another thread.
    // If this is not provided, the implementation will not do any work on any threads.
    //
    // The callback will be called with the task that the runtime wants to do on a thread.
    // This task should be spawned onto a threadpool, immediately invoked inside the callback, or otherwise
    // made to execute. 
    //
    // It should be assumed that the work spawned on this callback will be of substantial time (1ms+) and pure compute.
    task_executor: Option<TaskCallback>,
    ...
}

impl Device {
    // On webgpu will call createRenderPipeline.
    // 
    // On native will:
    // - If allow_async is false will create the render pipeline inside the call.
    // - If allow_async is true, the implementation is allowed (but not required) to spawn a
    //   job on the task callback to do the work of compilation if such a callback exists. This leads to
    //   less predicable performance but increased overall performance as compilation is parallelized.
    fn create_render_pipeline(&self, desc: RenderPipelineDescriptor, allow_async: bool) -> RenderPipeline;

    // On webgpu will call createRenderPipelineAsync.
    // 
    // On native will:
    // - Spawn a job on the instance's `task_executor` if it exists to generate the pipeline. Otherwise:
    // - Create the render pipeline inside the call.
    async fn create_render_pipeline_async(&self, desc: RenderPipelineDescriptor) -> RenderPipeline;
}

This api should allow people to use arbitrary integrations:

let desc = InstanceDescriptor {
    ...
    task_executor: Some(Box::new(|task| tokio::spawn_blocking(task)))
}
let desc = InstanceDescriptor {
    ...
    task_executor: Some(Box::new(|task| rayon::spawn(task)))
}
let my_fancy_threadpool_spawner: Arc<T> = ...;
let desc = InstanceDescriptor {
    ...
    task_executor: Some(Box::new(move |task| my_fancy_threadpool_spawner.spawn(task)))
}

Looking forward to people's thoughts on this. This kind of design will also open the door to other possible optimizations like this.

cwfitzgerald avatar Jun 29 '23 21:06 cwfitzgerald

Why do we have to worry about this at all? Why can't the user just

  • put their Device in an Arc or used scoped threads or whatever
  • create their own thread or get one from a pool however they like
  • call Device::create_compute_pipeline there, and
  • send it back to the thread that needs it when it's done?

jimblandy avatar Jun 29 '23 21:06 jimblandy

In other words - we have this thread-safe API, so the whole point should be to make the user deal with that stuff. @kpreid, can't you just write your own create_compute_pipeline_async? It's not like wgpu is any better at threading than you are.

jimblandy avatar Jun 29 '23 21:06 jimblandy

There's a couple considerations here that are pushing towards having internal handling:

  • Unifying the behavior of create_compile_pipeline_async on both wasm and native. The async version is the preferred version on web and this would make it consistent on native.
  • Dawn really wants to be able to make the non-async version of pipeline creation do as much work in parallel as possible. We want to have a C backend on top of webgpu.h, and this having this exposed means that you can hook them into your own thread pool.

The question of "why do we care at all" is still a good one.

cwfitzgerald avatar Jun 29 '23 22:06 cwfitzgerald

It pretty much boils down to the WASM implementations. For Jim's userspace solution to work on WASM:

  • There needs to be multithreading the Web API (or the WASM binding implementation needs to proxy the pipeline creation call back to the device's thread)
  • To actually get any benefit out of it, the browser must implement a multithreading API and also have multiple GPU-process threads, either 1:1 with JS threads or with its own thread pool and scheduler, such that the pipeline creation doesn't block other work

Additionally, on WASM, the extra thread needed to initiate that pipeline creation is useless - the actual parallelization is happening in the GPU process, so an extra JS thread is wasted overhead. And JS threads are very expensive compared to native threads. So it works fine if you have that thread already, but it's detrimental if you didn't actually need it.

Hence I think it's best for the default experience in native to match the experience in JS (or other remoting implementations) closely where possible.

kainino0x avatar Jun 29 '23 22:06 kainino0x

Okay - I understand what I wasn't getting before.

  • The WebGPU create*PipelineAsync functions expose parallelism not otherwise available to web content: at least in today's browsers, content cannot just fork off another thread, call GPUDevice.create*Pipeline (sans Async) , and send the pipeline back to the caller once it's usable - but create*PipelineAsync essentially gets you the same effect.

  • That means that this new parallelism can only be offered to programs using wgpu on the web if wgpu's web backend calls those functions.

  • That means that wgpu's own API must include those functions.

  • Since wgpu wants to provide a common API for web and native, wgpu's native API must now support these functions (even though native-only code should always prefer to just take care of the matter itself, as I suggested above).

  • Connor's proposal assumes that native wgpu is just going to fork off a thread, and lets wgpu's user set the policy for doing so, in a way that should only impact setup code.

jimblandy avatar Jun 30 '23 00:06 jimblandy

Additionally, on WASM, the extra thread needed to initiate that pipeline creation is useless - the actual parallelization is happening in the GPU process, so an extra JS thread is wasted overhead.

Right - a separate thread in the content process invites the use of a separate thread in the GPU process, but doesn't require it, so it's useless.

jimblandy avatar Jun 30 '23 00:06 jimblandy

I like the general idea. How would a user of the API know when a task is done?

nical avatar Jul 06 '23 14:07 nical

The user would need to have their own signalling as part of the function provided to the hook.

cwfitzgerald avatar Sep 19 '23 18:09 cwfitzgerald

Just to add a reason for why this is needed on native, Metal is weird and will block on creating pipelines unless you pass in a callback at pipeline creation time. (bevy's async pipeline compilation ran into this with wgpu's existing create_*_pipeline(), and it ended up being quite a bit slower to try to create metal pipelines asynchronously/in parallel).

Elabajaba avatar Feb 20 '24 16:02 Elabajaba

VK_KHR_deferred_host_operations may be interesting as API design prior art and/or a backend on which this might be implemented.

Ralith avatar Mar 05 '24 23:03 Ralith