[Experiment] WebGPU backend
This PR adds an experiment WebGPU backend which only has support for binary ops, this is not something aimed to be merged, it only means to show the possibility.
The actual shaders and WebGPU API calls are put in a separate project: https://github.com/frost-beta/betann.
To build:
cmake . -Bbuild -DMLX_BUILD_WEBGPU=ON -DMLX_BUILD_EXAMPLES=ON
cmake --build build -j 16
Run the example:
$ ./build/examples/cpp/tutorial
array([5, 7, 9], dtype=float32)
@zcbenz that's extremely cool. I'm supportive of exploring the addition of WebGPU as a back-end for MLX.
One initial comment is it would be nice to avoid breaking the "unified memory" programming model. So instead of changing the array API.. it might make more sense to change the WebGPU specific allocator (and maybe kernels) to have a Buffer which can (but don't have to) have both a CPU buffer and a GPU buffer. Then the buffer can manage if/when it needs to make a copy based on if you request the CPU or GPU pointer.
This actually fits pretty well with our notion of Buffer already which has a raw_ptr() method to get the CPU pointer (that could do the copy if needed). Or we could modify that API a little to make it more explicit.
I'm also very curious to know If there are any other major internal API changes needed or if mostly it just plugs in without much difficulty.
Thanks for your support!
What do you think if we add a "null" backend in upstream that mimics the general gpu backend by copying data to a separate buffer and then just calls eval_cpu? In that way we can explore what changes are actually needed for internal APIs without checking in any WebGPU code, and I can incrementally make changes while getting more familiar with WebGPU.
What do you think if we add a "null" backend in upstream that mimics the general gpu backend by copying data to a separate buffer and then just calls eval_cpu?
If you a share a PR for what you mean it might be easier to say if this is something we could include upstream. I'm not sure we necessarily need to merge it though.. it seems like it could be ok to just have a fork / branch of this for now until we converge a bit on what's useful there.
@awni I have updated the code to use a custom allocator to create data the holds both CPU and GPU buffers, can you do a simple review?
There are also 2 allocator design decisions that need your help:
-
Once the data has been copied from GPU to CPU, the array's
data_ptrneeds to be updated to point to the CPU data. I added a simple API to reset it, is there a better way to do that? -
Most existing code use
malloc_or_waitto allocate memory for kernels. In webgpu backend we need to explicitly specify the device where to allocate the memory, which means we can not reuse existing utilities likeset_binary_op_output_dataorbroadcastin webgpu backend. Can we add adeviceparameter tomalloc_or_wait? There is no performance penalty, and otherwise we have to duplicate lots of code to simply replacemalloc_or_waitwithgpu_malloc, like theset_binary_op_output_gpu_datafunction in this PR.
can you do a simple review?
Sorry for the delay. I will take a look early this week and think about your other comments as well.