mlx Use cblas_cgemm for CPU complex matmul

Should be faster than the current op based approach.

https://developer.apple.com/documentation/accelerate/1513288-cblas_cgemm?language=objc

Apr 14 '25 17:04 barronalex

Hi! I'm interested in working on this enhancement.

From what I understand, the current CPU complex matmul is implemented via a custom op-based approach, and this issue suggests switching to cblas_cgemm from the Accelerate framework to improve performance.

I'll start by identifying where complex matrix multiplication is currently handled in the codebase (likely under mlx/ops or similar). Then I plan to replace the relevant path for CPU execution with a call to cblas_cgemm, ensuring the data matches what the function expects.

Does that sound correct? Let me know if there's anything specific I should be aware of before proceeding.

Apr 29 '25 13:04 charan-003

I think the problem here is that we want to avoid dispatching differently based on the device. My preference would be to add a complex matmul and gemv for Metal and then once we have that we can run the complex op inside the primitive itself for both the CPU and GPU.

If you are interested in working on the implementation of the complex matmul for metal that would be a good place to start. But it requires some knowledge of Metal / MLX internals.

Apr 29 '25 13:04 awni

Thanks for the clarification that makes a lot of sense!

I'm very interested in learning how MLX's Metal backend works. While I'm new to Metal and the MLX internals, I'm eager to dive in and would really appreciate your guidance on getting started.

Specifically, it would be helpful to know:

Where the current Metal matmul implementation is located
How new Metal primitives are typically added or structured in the codebase
Any tips or resources for testing and debugging Metal-based ops within MLX

If there are any internal docs or examples I should look at first, I'd be grateful for any pointers. I'm excited to dig into this!

Apr 29 '25 13:04 charan-003

Hi @awni

I’ve finished the cblas_cgemm integration and am currently stuck on the Metal GPU part. I implemented a complex64_t-specialized BlockMMA (with four MMAs inside) to make it easy to integrate across GEMM variants, but it’s about 30% slower than the current ops-based dispatch. My guess is that BlockMMA is already highly optimized, and the extra intermediate matrices are creating register/L1 pressure. I’m seeing ~100% L1 cache eviction and only ~40% F32 ALU utilization.

I’m unsure whether I should:

implement the higher-level dispatch in Matmul and keep the three-MMA + combination approach, or

keep pushing on the complex-specialized BlockMMA and try to optimize it.

Do you have guidance on which direction is preferred for MLX? And if the BlockMMA route makes sense, are there recommended tile shapes, fragment sizes, or load/store patterns for complex data on Metal to reduce register/L1 pressure?

Happy to share profiling traces, kernel configs, and test cases if helpful. Thanks!

Sep 03 '25 22:09 CC-Yeh

The ops based dispatch is only three matmuls? Could you do the kernel implementation with just three like we do in the ops?

My guess is that BlockMMA is already highly optimized, and the extra intermediate matrices are creating register/L1 pressure.

Seems plausible. And it would probably help to tune the tile sizes etc.

Do you have guidance on which direction is preferred for MLX?

I would say the preferred approach is the faster / more efficient one. Which I would expect to be the low-level implementation. If you are up for seeing if you can make it work that would be great!

Sep 04 '25 05:09 awni