mlx [feature request] MLX is Exciting! Will it support writing Metal kernel through Python in the future?

Hi, It's so exciting that Apple has released its own MLX! The performance boost is impressive. However, my workflow includes writing some CUDA kernels using Numba CUDA. I’m wondering if MLX will one day support similar things. Low level operations like self-defined GPU kernel functions are important to the researchers. To be honest, Metal is not a popular framework for the researchers, neither is Swift/Xcode. If MLX can bring low level kernel capabilities to Python, like Numba CUDA JIT did, then we can do more research with Mac and can confidently tell others that getting a MacBook Pro doesn’t mean you can not demo research models on your laptop.

I know this may not be the priority of Apple team yet, but hopefully this can happen one day, as Numba CUDA is developed by Nvidia team, which allows more non-C++ researchers to accelerate their models using the parallel computing power of GPU.

Thanks

Dec 14 '23 02:12 Crear12

I think that would be quite nice! I 'll label it as a feature request.

Dec 14 '23 07:12 angeloskath

Some related information:

How to write Numba's GPU Kernel: https://numba.pydata.org/numba-doc/latest/cuda/kernels.html
What blocks Numba from supporting Metal: https://github.com/numba/numba/issues/5706

From my understanding, Numba speeds up the Python codes through JIT (Just-In-Time), and uses LLVM to generate the machine code (something like C++). When it allows writing GPU kernels in Python, it calls related LLVM to generate GPU codes. It can't support Metal because they have trouble connecting to the LLVM from Apple Silicon.

Dec 14 '23 14:12 Crear12

Here's a use case of MLX that can potentially outperform CUDA based on the low-latency unified memory:

I wrote a simulation for my safety-related research using Numba CUDA, running on an Intel 13900KF and Nvidia 3090. The GPU part takes 181-334 ms to copy the data into GPU memory and only 89-91 ms to process. If the kernel can be rewritten using MLX with the low-latency advantage of unified memory, it will be able to achieve up to a 4-times performance boost. A difference of 500 ms versus 100 ms is significant in the safety area.

Jan 08 '24 00:01 Crear12

I'm not an dl compilers expert, but I think OpenAI's Triton might be an alternative choice. In short it's like a DSL in Python that allows you to write CUDA kernels, and provide a comparable performance with writing CUDA code using CUTLASS.

Some related information:

Looks like PyTorch is already trying to integrate Triton: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py From my understanding of https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747#torchinductor-design-1, torch inductor is the new PyTorch backend introduced in PyTorch 2.0 version, which support Nvidia GPU via Triton.
Triton uses MLIR now, which means if we want to support Metal backend there is going to be something we can reuse in higher IR levels, and what we need to do is figure out how to lower it to Metal backend. This might also benefit mlx if we want to support computation graph trace & compile with optimizations just like in the latest PyTorch.
Triton should be able to support GPUs other than Nvidia in the future as mentioned in https://github.com/openai/triton/blob/main/docs/meetups/07-18-2023/notes.md. From what I know it seems that Intel and AMD is working on this, so I guess Metal might be feasible.

Nevertheless I'm not really an expert and my knowledge is very limited, but I would be looking forward to any new progress on this topic.

Jan 11 '24 23:01 PRESIDENT810

I'm quite curious if MLX might be used as a backend for Modular Mojo🔥 (which will be open sourced) or their recently announced MAX platform.

Jan 12 '24 00:01 altaic