Halide icon indicating copy to clipboard operation
Halide copied to clipboard

Add initial support for WebGPU

Open jrprice opened this issue 3 years ago • 11 comments

This adds the core of a new WebGPU backend. There are still many holes in the functionality and rough edges around how this interacts with Emscripten, native WebGPU implementations, and testing. Nevertheless, I'm opening this after offline discussion with @steven-johnson to get some initial feedback on this backend (and to raise awareness that this is happening).

While a lot of the implementation still remains to be done, this PR provides enough to run a 32-bit version of apps/blur, the gpu_only AOT generator test, and the correctness_bounds JIT test. I've also got the HelloWasm app running with the render pipeline targeting the GPU (and the other two pipelines still using WASM CPU).

For testing, the AOT generator tests can be made to work for wasm-32-wasmrt-webgpu when using Node bindings for Dawn (Chromium's native implementation of WebGPU). Both the AOT tests and JIT tests can also work for host-webgpu when using Dawn. JIT is not currently supported for wasm-32-wasmrt-webgpu.

Unlike the other GPU backends, I'm not employing the dlopen/dlsym approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.

Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.

There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.

All feedback is very welcome! I've joined the Gitter room as well.

jrprice avatar Dec 10 '21 19:12 jrprice

(I took the liberty of pacifying clang-tidy and bringing up to date with top-of-tree; please let me know if you'd prefer I avoid doing changes of this sort here in the future, but these seemed harmless)

steven-johnson avatar Dec 15 '21 00:12 steven-johnson

Maybe I missed it here, but could you add something (either in this PR or in a README, or both) about what is needed to configure this for local testing? e.g. how do I install and configure Dawn for use, what test(s) are expected to work or not work, etc? I'd like to pull this locally and try it out but not sure about the details.

steven-johnson avatar Dec 15 '21 19:12 steven-johnson

Maybe I missed it here, but could you add something (either in this PR or in a README, or both) about what is needed to configure this for local testing? e.g. how do I install and configure Dawn for use, what test(s) are expected to work or not work, etc? I'd like to pull this locally and try it out but not sure about the details.

Sorry for the delay - I've added a README_webgpu.md file that should cover this. These are the tests that are currently expected to work:

correctness_bounds
correctness_func_lifetime
correctness_func_lifetime_2
correctness_gpu_give_input_buffers_device_allocations
correctness_gpu_jit_explicit_copy_to_device
correctness_hello_gpu
correctness_loop_invariant_extern_calls
correctness_lots_of_loop_invariants
correctness_parallel_gpu_nested

generator_aot_gpu_only

It's a pretty small set, but I expect that many more will pass without too much additional effort. If you'd like a larger set to pass before landing this PR just let me know.

jrprice avatar Jan 21 '22 01:01 jrprice

Any updates on this? How close are we getting to complete-enough functionality to do real testing?

steven-johnson avatar Mar 31 '22 17:03 steven-johnson

Any updates on this? How close are we getting to complete-enough functionality to do real testing?

This PR is now passing 60/91 of the GPU-enabled correctness tests and there's 9 that I think are n/a for WebGPU. So 22 actual test failures remaining.

One of the main remaining pieces of work here is to handle buffer cropping/slicing and non-contiguous copies. Dynamically-sized GPU tiles and shared memory regions is another source of failures.

I've been working with my colleagues to try to figure out a way to expose functionality that would allow Halide to implement 16-bit and 8-bit integers more efficiently. If the slow 8/16-bit performance isn't considered a blocker for landing this PR though, I can get back to knocking out the remaining test failures in the meantime.

jrprice avatar Mar 31 '22 18:03 jrprice

implement 16-bit and 8-bit integers more efficiently

IIRC, @shoaibkamil / @slomp did something like that for the D3D12 backend, but I don't know the details (or am misremembering). Not sure if it's something that could be recycled and/or shared. Pinging them here for response :-)

steven-johnson avatar Mar 31 '22 18:03 steven-johnson

IIRC, @shoaibkamil / @slomp did something like that for the D3D12 backend, but I don't know the details (or am misremembering). Not sure if it's something that could be recycled and/or shared. Pinging them here for response :-)

Yes, we've been looking closely at how the D3D12 backend handles this stuff (via typed UAV loads/stores). The issue right now is that WebGPU does not currently expose an equivalent feature, so we either need to add such functionality to WebGPU itself, or do something different. Hoping to have a path forward soon though.

jrprice avatar Mar 31 '22 18:03 jrprice

Any update on this?

steven-johnson avatar Apr 27 '22 16:04 steven-johnson

Any update on this?

Nothing lately, but I expect to back working on this again soon.

jrprice avatar Apr 29 '22 00:04 jrprice

What's the status on this PR -- is activity likely to be resumed anytime soon?

steven-johnson avatar Sep 13 '22 21:09 steven-johnson

What's the status on this PR -- is activity likely to be resumed anytime soon?

This PR should be ready for review and further testing by others, as it implements the baseline functionality that we had previously agreed on a call.

On my macOS machine using Dawn native, all correctness tests are passing except:

  • correctness_gpu_dynamic_shared
  • correctness_gpu_non_monotonic_shared_mem_size
  • correctness_gpu_param_allocation
  • correctness_gpu_reuse_shared_memory
  • correctness_gpu_specialize
  • correctness_interpreter
  • correctness_isnan

These failures are caused by dynamically sized shared memory (supported by WebGPU but not yet implemented in Dawn/Chrome), lack of NaN support (discussed previously), and exceeding maximum allocation sizes for device memory and shared memory.

I'm currently on leave until the end of October, and will be able to address merge conflicts / review comments when I return. I'll also then start looking at ways to improve performance for small integer types.

jrprice avatar Sep 15 '22 16:09 jrprice

The WebGPU backend now passes all correctness tests except correctness_isnan on my macOS machine using Dawn native on an Intel GPU and an AMD GPU. ~There are a few failures on AMD due to a Metal compiler bug.~ (EDIT: Resolved the AMD issues)

This PR should now be ready for review and further testing by others. Everything that was previously agreed for this backend's MVP has now been implemented.

jrprice avatar Nov 21 '22 19:11 jrprice

Thanks for the review! I'll address the comments as soon as I can.

What OS(s) should we target for testing? I'm assuming maybe OSX and Linux-x64?

I've been testing locally on macOS (AMD + Intel). I've just given it a whirl on Linux, but I'm getting some test failures. It looks like we have at least one bug in Dawn's Vulkan backend that is causing issues, which I'll try and get sorted.

Is macOS enough coverage to get started with? If not I'll try and get everything working on Linux as soon as I can.

jrprice avatar Dec 07 '22 21:12 jrprice

Nope, OSX should be fine. I assume that both x86 and arm variants should work?

steven-johnson avatar Dec 07 '22 21:12 steven-johnson

I assume that both x86 and arm variants should work?

I haven't got an Arm-based macbook to test on, but I have no reason to believe that it wouldn't work (i.e. Dawn has been tested by others on M1 macbooks and works).

jrprice avatar Dec 08 '22 14:12 jrprice

Hi there -- I'm (finally) looking at getting testing in place, so we can land this, and some of the comments from your original post may or may not be out of date since late 2021:

For testing, the AOT generator tests can be made to work for wasm-32-wasmrt-webgpu when using Node bindings for Dawn. Both the AOT tests and JIT tests can also work for host-webgpu when using Dawn.

JIT is not currently supported for wasm-32-wasmrt-webgpu.

I assume this is still the case?

Unlike the other GPU backends, I'm not employing the dlopen/dlsym approach in the runtime for getting the API functions. I'm not sure how to make this work when using Emscripten, and since using Dawn directly is only really needed for testing purposes it doesn't seem /too/ onerous to require direct linking, but I'm open to opinions and suggestions here.

IIUC, Dawn is both a native library (for the C++ API) but also available integrated into Node.js. Testing with just a native library is easier in Halide's world (and also the only way to test JIT stuff)... I presume we are probably going to need to do at least some testing in the Emscripten/Node world too (unless you say otherwise).

Direct linking to Dawn is indeed something that may be painful to do, for various reasons, but let me actually try it out before I worry any more.

Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations, so there is a build-time switch to toggle between targeting Emscripten vs Dawn. I'm optimistic that this requirement will eventually go away.

Over a year later, is this still accurate, or has the C++ API settled down?

There's still some patches that need to land in both Emscripten and Dawn before this backend will work for anyone else, so I'll leave this PR as a draft until those are resolved.

Any idea if this is still the case?

When invoking emcc to link Halide-generated objects, include these flags: -s USE_WEBGPU=1 -s ASYNCIFY.

Is ASYNCIFY still necessary?

Building Dawn's Node.js bindings currently requires using CMake.

It looks like both Dawn and Dawn-with-Node require building from source -- i.e., there aren't any prebuilts available, either via download or via (e.g. Homebrew). Is this still the case?

steven-johnson avatar Feb 22 '23 20:02 steven-johnson

FYI: I took the liberty of adding 'support' for isinf(), isnan(), and isfinite(), so that Halide code that uses these won't fail to compile outright. (I realize the spec explicitly says that nan/inf may or may not be supported, and AFAICT the current Dawn implementation on Mac definitely does not -- even passing in a buffer prepopulated with NaN values will get normalized into zeros -- but use of these functions is caveat emptor already, in that the caller must know that they are on a target that supports them). LMK your thoughts.

steven-johnson avatar Feb 23 '23 18:02 steven-johnson

As of now, running the tests with Dawn-native (ie host-webgpu), the only tests failing are:

	234 - correctness_multi_way_select (Subprocess aborted)
	518 - performance_async_gpu (Failed)
	520 - performance_boundary_conditions (Failed)
	588 - generator_aot_gpu_multi_context_threaded (SEGFAULT)
	589 - generator_aotcpp_gpu_multi_context_threaded (Subprocess aborted)
	674 - python_tutorial_lesson_10_aot_compilation_run (Failed)

I'm going to take a quick look to see if any of these are things that have obvious fixes, then move on to testing with the Emscripten setup.

steven-johnson avatar Feb 23 '23 19:02 steven-johnson

JIT is not currently supported for wasm-32-wasmrt-webgpu.

I assume this is still the case?

Correct.

I presume we are probably going to need to do at least some testing in the Emscripten/Node world too (unless you say otherwise).

I would imagine so. The Dawn Node bindings can be used when targeting wasm-32-wasmrt-webgpu with AOT compilation. Do you think that is sufficient? That said, I'm struggling to get the generator_aot_gpu_only test to pass with WASM + dawn.node right now - will investigate...

Another pain point right now is that the C++ API for WebGPU is not currently stable between different implementations...

Over a year later, is this still accurate, or has the C++ API settled down?

The API isn't changing much, but there's still a discrepancy between Dawn and Emscripten (see comment in #7248).

There's still some patches that need to land in both Emscripten and Dawn before this backend will work

Any idea if this is still the case?

This should "just work" with ToT Dawn and Emscripten now.

When invoking emcc to link Halide-generated objects, include these flags: -s USE_WEBGPU=1 -s ASYNCIFY.

Is ASYNCIFY still necessary?

Yes, for now. If we want to remove this dependency I suspect we may need to expose some asynchronous runtime functions from Halide for host<->device transfers and device init, and then leave the details of yielding to the browser up to the application code.

Building Dawn's Node.js bindings currently requires using CMake.

It looks like both Dawn and Dawn-with-Node require building from source -- i.e., there aren't any prebuilts available, either via download or via (e.g. Homebrew). Is this still the case?

Correct. I'm not aware of any plans to make prebuilts available, but maybe this is something we can consider once we've shipped V1 in Chrome.

FYI: I took the liberty of adding 'support' for isinf(), isnan(), and isfinite() ... LMK your thoughts.

Thanks! I didn't add them myself since they weren't sufficient to pass the isnan correctness test, but what you've done is fine with me.

jrprice avatar Feb 23 '23 20:02 jrprice

~~So it looks to me like this backend isn't currently safe to use from multiple threads -- we store device (etc) in a global, but competing threads can overwrite this value. Am I missing something, or is this just an oversight for a first draft?~~

Nevermind -- the test I was looking at (gpu_multi_context_threaded_aottest) does a LOT of special-case hackery with the GPU backends for its own purposes, and needs similar attention for WebGPU -- working on a fix.

steven-johnson avatar Feb 23 '23 22:02 steven-johnson

Please take a look at https://github.com/jrprice/Halide/pull/1 at your convenience.

steven-johnson avatar Feb 24 '23 01:02 steven-johnson

What's the technical reason that we can't support WebGPU under the JIT?

EDIT: I assume that at least part of the reason is that we'd need to add callback bindings for the wgpu API in WasmExecutor, which would be kind of a pain but theoretically doable... are there additional reasons this might be infeasible?

steven-johnson avatar Feb 24 '23 18:02 steven-johnson

What's the technical reason that we can't support WebGPU under the JIT?

The WebGPU runtime code requires Emscripten when targeting WASM in order to translate calls to the native WebGPU runtime API into the Javascript API. I think this is a similar requirement to WasmThreads.

jrprice avatar Feb 24 '23 19:02 jrprice

The WebGPU runtime code requires Emscripten when targeting WASM in order to translate calls to the native WebGPU runtime API into the Javascript API.

Wait, so the WebGPU runtime code relies on Emscripten? Does this mean that (e.g.) Chrome has to have some of Emscripten baked into it to make this work?

I think this is a similar requirement to WasmThreads.

The WasmThreads stuff is because Wasm didn't have any real threading support at all, but Emscripten added pthreads wrappers to make it work. Halide didn't want to reinvent that wheel for JIT testing.

steven-johnson avatar Feb 24 '23 19:02 steven-johnson

Wait, so the WebGPU runtime code relies on Emscripten? Does this mean that (e.g.) Chrome has to have some of Emscripten baked into it to make this work?

No, I mean that the WASM object that Halide generates needs to be compiled with Emscripten, to link the native WebGPU APIs calls against Emscripten's implementation of WebGPU, which forwards those calls to the Javascript API. Sorry if I'm not explaining this right, I'm not hugely familiar with all this WASM/Emscripten stuff in general.

The WasmThreads stuff is because Wasm didn't have any real threading support at all, but Emscripten added pthreads wrappers to make it work.

I think this is similar to what I'm describing above? You need something to provide an implementation of the WebGPU APIs, and Emscripten is that thing when we target WASM. In theory you could do it manually, as you suggest, though you'd also need to yield control back to the browser for the async stuff (which is what we currently use Emscripten's ASYNCIFY for).

jrprice avatar Feb 24 '23 19:02 jrprice

~~OK, now I'm testing the AOT tests using Emscripten/Node (building with HL_TARGET=wasm-32-wasmrt-webgpu and WEBGPU_NODE_BINDINGS set to the right path), and some (but not all) of the tests fail with ReferenceError: navigator is not defined. Investigating, but if you've seen this before...~~

EDIT: I am an idiot and wasn't actually setting WEBGPU_NODE_BINDINGS properly. Now I'm failing in other ways :-)

steven-johnson avatar Feb 24 '23 21:02 steven-johnson

So, running the generator tests under Node, most of them pass, but these three fail:

548 - generator_aot_acquire_release (Failed) 549 - generator_aot_alias (Failed) 565 - generator_aot_gpu_only (Failed)

In all cases, we're getting what looks like garbage in the output buffer -- my first guess would be that something in copy_to_host is broken, but not sure how or why. Thoughts or suggestions welcome.

steven-johnson avatar Feb 24 '23 22:02 steven-johnson

Right, I'm seeing the same thing for generator_aot_gpu_only on my machine (didn't test the other two). This definitely used to pass. I will set aside some time to investigate this early next week (nothing obvious jumps out at me right now).

jrprice avatar Feb 24 '23 23:02 jrprice

The copy_to_host bug was in Dawn's Node bindings. I've just landed a fix in Dawn here: https://dawn-review.googlesource.com/c/dawn/+/121820 With this change, generator_aot_gpu_only passes again for me, along with generator_aot_acquire_release and generator_aot_alias.

jrprice avatar Feb 27 '23 21:02 jrprice

The copy_to_host bug was in Dawn's Node bindings. I've just landed a fix in Dawn here: https://dawn-review.googlesource.com/c/dawn/+/121820 With this change, generator_aot_gpu_only passes again for me, along with generator_aot_acquire_release and generator_aot_alias.

So if I rebuild Dawn at top-of-tree, this should work?

(Related: does Dawn have release version(s) at this point? It would be nice if the README here could say "works as of release tag X")

steven-johnson avatar Feb 28 '23 17:02 steven-johnson