Halide
Halide copied to clipboard
Add initial support for WebGPU
This adds the core of a new WebGPU backend. There are still many holes in the functionality and rough edges around how this interacts with Emscripten, native WebGPU implementations, and testing. Nevertheless, I'm opening this after offline discussion with @steven-johnson to get some initial feedback on this backend (and to raise awareness that this is happening).
While a lot of the implementation still remains to be done, this PR provides enough to run a 32-bit version of apps/blur
, the gpu_only
AOT generator test, and the correctness_bounds
JIT test. I've also got the HelloWasm app running with the render
pipeline targeting the GPU (and the other two pipelines still using WASM CPU).
For testing, the AOT generator tests can be made to work for wasm-32-wasmrt-webgpu
when using Node bindings for Dawn (Chromium's native implementation of WebGPU). Both the AOT tests and JIT tests can also work for host-webgpu
when using Dawn. JIT is not currently supported for wasm-32-wasmrt-webgpu
.
Unlike the other GPU backends, I'm not employing the dlopen
/dlsym
approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.
Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.
There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.
All feedback is very welcome! I've joined the Gitter room as well.
(I took the liberty of pacifying clang-tidy and bringing up to date with top-of-tree; please let me know if you'd prefer I avoid doing changes of this sort here in the future, but these seemed harmless)
Maybe I missed it here, but could you add something (either in this PR or in a README, or both) about what is needed to configure this for local testing? e.g. how do I install and configure Dawn for use, what test(s) are expected to work or not work, etc? I'd like to pull this locally and try it out but not sure about the details.
Maybe I missed it here, but could you add something (either in this PR or in a README, or both) about what is needed to configure this for local testing? e.g. how do I install and configure Dawn for use, what test(s) are expected to work or not work, etc? I'd like to pull this locally and try it out but not sure about the details.
Sorry for the delay - I've added a README_webgpu.md
file that should cover this. These are the tests that are currently expected to work:
correctness_bounds
correctness_func_lifetime
correctness_func_lifetime_2
correctness_gpu_give_input_buffers_device_allocations
correctness_gpu_jit_explicit_copy_to_device
correctness_hello_gpu
correctness_loop_invariant_extern_calls
correctness_lots_of_loop_invariants
correctness_parallel_gpu_nested
generator_aot_gpu_only
It's a pretty small set, but I expect that many more will pass without too much additional effort. If you'd like a larger set to pass before landing this PR just let me know.
Any updates on this? How close are we getting to complete-enough functionality to do real testing?
Any updates on this? How close are we getting to complete-enough functionality to do real testing?
This PR is now passing 60/91 of the GPU-enabled correctness tests and there's 9 that I think are n/a for WebGPU. So 22 actual test failures remaining.
One of the main remaining pieces of work here is to handle buffer cropping/slicing and non-contiguous copies. Dynamically-sized GPU tiles and shared memory regions is another source of failures.
I've been working with my colleagues to try to figure out a way to expose functionality that would allow Halide to implement 16-bit and 8-bit integers more efficiently. If the slow 8/16-bit performance isn't considered a blocker for landing this PR though, I can get back to knocking out the remaining test failures in the meantime.
implement 16-bit and 8-bit integers more efficiently
IIRC, @shoaibkamil / @slomp did something like that for the D3D12 backend, but I don't know the details (or am misremembering). Not sure if it's something that could be recycled and/or shared. Pinging them here for response :-)
IIRC, @shoaibkamil / @slomp did something like that for the D3D12 backend, but I don't know the details (or am misremembering). Not sure if it's something that could be recycled and/or shared. Pinging them here for response :-)
Yes, we've been looking closely at how the D3D12 backend handles this stuff (via typed UAV loads/stores). The issue right now is that WebGPU does not currently expose an equivalent feature, so we either need to add such functionality to WebGPU itself, or do something different. Hoping to have a path forward soon though.
Any update on this?
Any update on this?
Nothing lately, but I expect to back working on this again soon.
What's the status on this PR -- is activity likely to be resumed anytime soon?
What's the status on this PR -- is activity likely to be resumed anytime soon?
This PR should be ready for review and further testing by others, as it implements the baseline functionality that we had previously agreed on a call.
On my macOS machine using Dawn native, all correctness tests are passing except:
- correctness_gpu_dynamic_shared
- correctness_gpu_non_monotonic_shared_mem_size
- correctness_gpu_param_allocation
- correctness_gpu_reuse_shared_memory
- correctness_gpu_specialize
- correctness_interpreter
- correctness_isnan
These failures are caused by dynamically sized shared memory (supported by WebGPU but not yet implemented in Dawn/Chrome), lack of NaN support (discussed previously), and exceeding maximum allocation sizes for device memory and shared memory.
I'm currently on leave until the end of October, and will be able to address merge conflicts / review comments when I return. I'll also then start looking at ways to improve performance for small integer types.
The WebGPU backend now passes all correctness tests except correctness_isnan
on my macOS machine using Dawn native on an Intel GPU and an AMD GPU. ~There are a few failures on AMD due to a Metal compiler bug.~
(EDIT: Resolved the AMD issues)
This PR should now be ready for review and further testing by others. Everything that was previously agreed for this backend's MVP has now been implemented.
Thanks for the review! I'll address the comments as soon as I can.
What OS(s) should we target for testing? I'm assuming maybe OSX and Linux-x64?
I've been testing locally on macOS (AMD + Intel). I've just given it a whirl on Linux, but I'm getting some test failures. It looks like we have at least one bug in Dawn's Vulkan backend that is causing issues, which I'll try and get sorted.
Is macOS enough coverage to get started with? If not I'll try and get everything working on Linux as soon as I can.
Nope, OSX should be fine. I assume that both x86 and arm variants should work?
I assume that both x86 and arm variants should work?
I haven't got an Arm-based macbook to test on, but I have no reason to believe that it wouldn't work (i.e. Dawn has been tested by others on M1 macbooks and works).
Hi there -- I'm (finally) looking at getting testing in place, so we can land this, and some of the comments from your original post may or may not be out of date since late 2021:
For testing, the AOT generator tests can be made to work for wasm-32-wasmrt-webgpu when using Node bindings for Dawn. Both the AOT tests and JIT tests can also work for host-webgpu when using Dawn.
JIT is not currently supported for wasm-32-wasmrt-webgpu.
I assume this is still the case?
Unlike the other GPU backends, I'm not employing the dlopen/dlsym approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.
IIUC, Dawn is both a native library (for the C++ API) but also available integrated into Node.js. Testing with just a native library is easier in Halide's world (and also the only way to test JIT stuff)... I presume we are probably going to need to do at least some testing in the Emscripten/Node world too (unless you say otherwise).
Direct linking to Dawn is indeed something that may be painful to do, for various reasons, but let me actually try it out before I worry any more.
Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.
Over a year later, is this still accurate, or has the C++ API settled down?
There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.
Any idea if this is still the case?
When invoking
emcc
to link Halide-generated objects, include these flags:-s USE_WEBGPU=1 -s ASYNCIFY
.
Is ASYNCIFY
still necessary?
Building Dawn's Node.js bindings currently requires using CMake.
It looks like both Dawn and Dawn-with-Node require building from source -- i.e., there aren't any prebuilts available, either via download or via (e.g. Homebrew). Is this still the case?
FYI: I took the liberty of adding 'support' for isinf(), isnan(), and isfinite(), so that Halide code that uses these won't fail to compile outright. (I realize the spec explicitly says that nan/inf may or may not be supported, and AFAICT the current Dawn implementation on Mac definitely does not -- even passing in a buffer prepopulated with NaN values will get normalized into zeros -- but use of these functions is caveat emptor already, in that the caller must know that they are on a target that supports them). LMK your thoughts.
As of now, running the tests with Dawn-native (ie host-webgpu), the only tests failing are:
234 - correctness_multi_way_select (Subprocess aborted)
518 - performance_async_gpu (Failed)
520 - performance_boundary_conditions (Failed)
588 - generator_aot_gpu_multi_context_threaded (SEGFAULT)
589 - generator_aotcpp_gpu_multi_context_threaded (Subprocess aborted)
674 - python_tutorial_lesson_10_aot_compilation_run (Failed)
I'm going to take a quick look to see if any of these are things that have obvious fixes, then move on to testing with the Emscripten setup.
JIT is not currently supported for wasm-32-wasmrt-webgpu.
I assume this is still the case?
Correct.
I presume we are probably going to need to do at least some testing in the Emscripten/Node world too (unless you say otherwise).
I would imagine so. The Dawn Node bindings can be used when targeting wasm-32-wasmrt-webgpu
with AOT compilation. Do you think that is sufficient? That said, I'm struggling to get the generator_aot_gpu_only
test to pass with WASM + dawn.node right now - will investigate...
Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations...
Over a year later, is this still accurate, or has the C++ API settled down?
The API isn't changing much, but there's still a discrepancy between Dawn and Emscripten (see comment in #7248).
There's still some patches that need to land in both Emscripten and Dawn before this backend will work
Any idea if this is still the case?
This should "just work" with ToT Dawn and Emscripten now.
When invoking
emcc
to link Halide-generated objects, include these flags:-s USE_WEBGPU=1 -s ASYNCIFY
.Is
ASYNCIFY
still necessary?
Yes, for now. If we want to remove this dependency I suspect we may need to expose some asynchronous runtime functions from Halide for host<->device transfers and device init, and then leave the details of yielding to the browser up to the application code.
Building Dawn's Node.js bindings currently requires using CMake.
It looks like both Dawn and Dawn-with-Node require building from source -- i.e., there aren't any prebuilts available, either via download or via (e.g. Homebrew). Is this still the case?
Correct. I'm not aware of any plans to make prebuilts available, but maybe this is something we can consider once we've shipped V1 in Chrome.
FYI: I took the liberty of adding 'support' for isinf(), isnan(), and isfinite() ... LMK your thoughts.
Thanks! I didn't add them myself since they weren't sufficient to pass the isnan
correctness test, but what you've done is fine with me.
~~So it looks to me like this backend isn't currently safe to use from multiple threads -- we store device (etc) in a global, but competing threads can overwrite this value. Am I missing something, or is this just an oversight for a first draft?~~
Nevermind -- the test I was looking at (gpu_multi_context_threaded_aottest
) does a LOT of special-case hackery with the GPU backends for its own purposes, and needs similar attention for WebGPU -- working on a fix.
Please take a look at https://github.com/jrprice/Halide/pull/1 at your convenience.
What's the technical reason that we can't support WebGPU under the JIT?
EDIT: I assume that at least part of the reason is that we'd need to add callback bindings for the wgpu API in WasmExecutor, which would be kind of a pain but theoretically doable... are there additional reasons this might be infeasible?
What's the technical reason that we can't support WebGPU under the JIT?
The WebGPU runtime code requires Emscripten when targeting WASM in order to translate calls to the native WebGPU runtime API into the Javascript API. I think this is a similar requirement to WasmThreads
.
The WebGPU runtime code requires Emscripten when targeting WASM in order to translate calls to the native WebGPU runtime API into the Javascript API.
Wait, so the WebGPU runtime code relies on Emscripten? Does this mean that (e.g.) Chrome has to have some of Emscripten baked into it to make this work?
I think this is a similar requirement to
WasmThreads
.
The WasmThreads stuff is because Wasm didn't have any real threading support at all, but Emscripten added pthreads wrappers to make it work. Halide didn't want to reinvent that wheel for JIT testing.
Wait, so the WebGPU runtime code relies on Emscripten? Does this mean that (e.g.) Chrome has to have some of Emscripten baked into it to make this work?
No, I mean that the WASM object that Halide generates needs to be compiled with Emscripten, to link the native WebGPU APIs calls against Emscripten's implementation of WebGPU, which forwards those calls to the Javascript API. Sorry if I'm not explaining this right, I'm not hugely familiar with all this WASM/Emscripten stuff in general.
The WasmThreads stuff is because Wasm didn't have any real threading support at all, but Emscripten added pthreads wrappers to make it work.
I think this is similar to what I'm describing above? You need something to provide an implementation of the WebGPU APIs, and Emscripten is that thing when we target WASM. In theory you could do it manually, as you suggest, though you'd also need to yield control back to the browser for the async stuff (which is what we currently use Emscripten's ASYNCIFY for).
~~OK, now I'm testing the AOT tests using Emscripten/Node (building with HL_TARGET=wasm-32-wasmrt-webgpu and WEBGPU_NODE_BINDINGS set to the right path), and some (but not all) of the tests fail with ReferenceError: navigator is not defined
. Investigating, but if you've seen this before...~~
EDIT: I am an idiot and wasn't actually setting WEBGPU_NODE_BINDINGS properly. Now I'm failing in other ways :-)
So, running the generator tests under Node, most of them pass, but these three fail:
548 - generator_aot_acquire_release (Failed) 549 - generator_aot_alias (Failed) 565 - generator_aot_gpu_only (Failed)
In all cases, we're getting what looks like garbage in the output buffer -- my first guess would be that something in copy_to_host is broken, but not sure how or why. Thoughts or suggestions welcome.
Right, I'm seeing the same thing for generator_aot_gpu_only
on my machine (didn't test the other two). This definitely used to pass. I will set aside some time to investigate this early next week (nothing obvious jumps out at me right now).
The copy_to_host
bug was in Dawn's Node bindings. I've just landed a fix in Dawn here:
https://dawn-review.googlesource.com/c/dawn/+/121820
With this change, generator_aot_gpu_only
passes again for me, along with generator_aot_acquire_release
and generator_aot_alias
.
The
copy_to_host
bug was in Dawn's Node bindings. I've just landed a fix in Dawn here: https://dawn-review.googlesource.com/c/dawn/+/121820 With this change,generator_aot_gpu_only
passes again for me, along withgenerator_aot_acquire_release
andgenerator_aot_alias
.
So if I rebuild Dawn at top-of-tree, this should work?
(Related: does Dawn have release version(s) at this point? It would be nice if the README here could say "works as of release tag X")