mlx [Experiment] WebGPU backend

This PR adds an experiment WebGPU backend which only has support for binary ops, this is not something aimed to be merged, it only means to show the possibility.

The actual shaders and WebGPU API calls are put in a separate project: https://github.com/frost-beta/betann.

To build:

cmake . -Bbuild -DMLX_BUILD_WEBGPU=ON -DMLX_BUILD_EXAMPLES=ON
cmake --build build -j 16

Run the example:

$ ./build/examples/cpp/tutorial
array([5, 7, 9], dtype=float32)

Jan 24 '25 02:01 zcbenz

@zcbenz that's extremely cool. I'm supportive of exploring the addition of WebGPU as a back-end for MLX.

One initial comment is it would be nice to avoid breaking the "unified memory" programming model. So instead of changing the array API.. it might make more sense to change the WebGPU specific allocator (and maybe kernels) to have a Buffer which can (but don't have to) have both a CPU buffer and a GPU buffer. Then the buffer can manage if/when it needs to make a copy based on if you request the CPU or GPU pointer.

This actually fits pretty well with our notion of Buffer already which has a raw_ptr() method to get the CPU pointer (that could do the copy if needed). Or we could modify that API a little to make it more explicit.

I'm also very curious to know If there are any other major internal API changes needed or if mostly it just plugs in without much difficulty.

Jan 26 '25 15:01 awni

Thanks for your support!

What do you think if we add a "null" backend in upstream that mimics the general gpu backend by copying data to a separate buffer and then just calls eval_cpu? In that way we can explore what changes are actually needed for internal APIs without checking in any WebGPU code, and I can incrementally make changes while getting more familiar with WebGPU.

Jan 27 '25 00:01 zcbenz

What do you think if we add a "null" backend in upstream that mimics the general gpu backend by copying data to a separate buffer and then just calls eval_cpu?

If you a share a PR for what you mean it might be easier to say if this is something we could include upstream. I'm not sure we necessarily need to merge it though.. it seems like it could be ok to just have a fork / branch of this for now until we converge a bit on what's useful there.

Jan 27 '25 17:01 awni

@awni I have updated the code to use a custom allocator to create data the holds both CPU and GPU buffers, can you do a simple review?

There are also 2 allocator design decisions that need your help:

Once the data has been copied from GPU to CPU, the array's data_ptr needs to be updated to point to the CPU data. I added a simple API to reset it, is there a better way to do that?
Most existing code use malloc_or_wait to allocate memory for kernels. In webgpu backend we need to explicitly specify the device where to allocate the memory, which means we can not reuse existing utilities like set_binary_op_output_data or broadcast in webgpu backend. Can we add a device parameter to malloc_or_wait? There is no performance penalty, and otherwise we have to duplicate lots of code to simply replace malloc_or_wait with gpu_malloc, like the set_binary_op_output_gpu_data function in this PR.

Jan 29 '25 01:01 zcbenz

can you do a simple review?

Sorry for the delay. I will take a look early this week and think about your other comments as well.

Feb 02 '25 04:02 awni