[BUG] Incorrect results when running mx.fast.metal_kernel with M1 Max processor, then correct when running with M3 Max
Describe the bug
The exact same kernel produces different results with mx.fast.metal_kernel when running with an M1 Max processor than when running with an M3 Max.
When using our old implementation with the py-metal compute as gold standard, only the execution with the M3 Max produces the correct results
To Reproduce
We created the repository https://github.com/acoreas/VWE_MLX/ to show the issue. This is a relatively complex kernel to solve the viscoelastic equation.
- Create a conda environment with the provided YAML file.
- Run the Notebook BabelViscoFDTD - MLX vs Metal Compute.ipynb.
When running with an M1 Max processor (we haven't tested with other M1 variants or with the M2 family) and macOS and Xcode updated to the latest versions, the comparison with py-metal-compute will fail, as will the comparison of results obtained with an M3 Max processor. When running with the M3 Max, both MLX and py-metal-compute results are identical.
Expected behavior
The output of the kernel execution should be the same when running with different Apple Silicon processors.
Desktop (please complete the following information):
- MacOS 15.3.1
- MLX 0.22.0 - the issue is also present with the latest version, 0.25.2
- Systems: M1 Max MacBook Pro with 64 GB and M3 Max MacBook Pro with 128 GB
Additional context We have been struggling to get this kernel to work in MLX for a while. This is the last kernel we need to port from py-metal-compute for our planning software. All other kernels have been ported successfully. The development has been primarily conducted on an M1 Max system, and we began experiencing issues with this latest kernel. We prepared this repository to highlight the issue that we first believed would be the same in any Apple Silicon system. However, during a separate test before submitting this bug report, we were surprised to find that the kernel works as expected on an M3 Max system, which makes the bug a bit more serious if it is indeed processor-dependent.
I did a quick scan of the notebook, nothing jumped out. Could you maybe run with METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 environment variables set to see if an error is reported when ran on the M1?
We tried running the notebook with those set environment variables and the kernel started dying, but didn't get any useful information from it. So we created a python script version of the notebook (now in the GitHub repo as well) and ran it again. Script also fails (only when those env vars are set), but we get the following message:
This is only for M1, M3 work with or without the flags enabled.
Yep ok, the grid needs to be a different size for the M1 vs the M3 basically this means that you can't run 1024 threads per thread group on the M1.
This isn't a bug per se but I think we should add the check in the mx.fast.metal_kernel so that the behavior is more friendly to the user instead of failing silently.
Thanks a lot for the help.
Is there a table somewhere that we could use to check these HW limitations? Or maybe having a function that returns the maximal acceptable value of the threadgroup size ?
As a follow-up comment, I just noticed that in the official Apple documentation of the Metal features per generation family (https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf), it indicates that the maximum thread group size is 1024 for all GPUs from Apple7 to Apple9 (which covers from M1 to M4). It is curious, then, that an error was not triggered for the M3 Max. Anyway, on our end, we can be careful not to exceed the 1024 limit, as that covers all GPUs.
What is a bit weird is that the error says that the limit should be less than 896 for the M1 Max. So, from where did that number come from? Is the size of the kernel affecting this number? If so, how can we calculate that limit?
Actually... now connecting dots.... this is highly related to our previous PR suggestion (#1833 ) that a flag to calculate the group thread size using the Apple documentation recommended formula could be beneficial. The conversation at that PR ended that using a limit of 1024 would be enough, but as we can see in this case, it is not. So passing that flag to calculate the group size would be super useful for very large kernels. This is likely the reason why the py-metal-compute implementation doesn't have this issue as the group size is always calculated using the recommended formula.
Yes exactly it is the same topic.
As I said before the grid should be tuned unfortunately. It is not as simple as setting it to the maximum value. 896 may not be the fastest for M1 as 1024 may not be the fastest for M3.
For this particular case I would most likely set it to a smaller value and check if there is a slow down (or speedup). There is probably register pressure which is why the thread group size can't be full on the M1.
I am reluctant to simply implement the option for set it to max possible because it is likely that when it fails you should take a look at the thread group dims anyway. I think maybe there is an opportunity for a check and an error report though. 🤔
Can we foresee that MLX could trigger an error if the thread group size is above the limit? We could at least use that to catch this and programmatically reduce the group size and re-try.
By saying this, I'm not totally convinced that using the recommended formula does not ensure a decent GPU performance. Certainly, it may not be optimal, but we can't afford to explore all combinations as the domain sizes change dramatically between simulation cases, so we need to find a calculation of group sizes that works across all possible scenarios.
We tested wide and large vs. CUDA using optimized thread group sizes, and that was when we managed to beat an A6000 GPU with the M1 Max. At that time, the A6000 GPU was even more expensive than the 64 GB M1 Max MacBook Pro. It took breaking the large kernel (in a 3D operation) into smaller kernels, but in the end, we were pretty satisfied. This helped us create a tool that was adopted by many labs. If this flag is still considered too disruptive (and I can't truly see why, because it does not compromise at all the default operation), we can just use our fork with that flag and leave it at that. Likely we should be able to sync our fork with the upstream branch for a while, as the change is very minor.
#2242 changes it so now MLX throws if the grid couldn't be launched. I am inclined to close this but if you still think it makes sense to have an automatically calculated thread group I will leave it open and take another look this week.
Thanks @angeloskath, feel free to close it. As I mentioned in my previous comment, if we really need the automatic group size calculation, we can do it in our folk. Right now, @acoreas is now trying to make the MLX version as fast as py-metal-compute; not yet there, but he is exploring some tweaks here and there. We may ask for advice if we can't find the final tweak.
As a side note, the MLX approach for custom kernels is a bit different compared to CUDA, OpenCL and py-metal-compute (or METAL+Swift), in the sense that the programmer does not have the same granular control of the input parameters to the kernel. For example, there is no way to reuse the same buffer for read+write operations as in regular METAL kernels interfaced with Swift. MLX assumes input buffers are different from outputs, so we ended up doing some gymnastics to match what we can do in regular Metal. We already had to do some gymnastics before, as we had to encapsulate different conceptual variables in the same buffer, as the number of buffers that can be passed is capped out by Metal. But now that we can't reuse the same buffers for read+write simultaneously, it added another layer of complexity, which is what we suspect we can't match yet the performance with py-metal-compute.
Is there any plan to support more traditional Metal kernel coding, where the programmer has full control over how each buffer is operated?
I just wanted to add that we've managed to get our VWE function working with MLX, however sometimes we need to make multiple calls to this function and this causes it to fail both in the notebook and the script version of the code. On the M1 mac, I get the following error:
2025-06-18 09:34:33.073 python[3851:62177] Execution of the command buffer was aborted due to an error during execution. Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim) libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)
It runs fine if I run the calls individually, but not consecutively so I'm guessing an issue with memory management? Any help on the matter is greatly appreciated! I've pushed my most recent changes to the same repo, thanks!
Seemed to have fixed the issue, I believe it was trying to use the cached kernel from the first call, but the second call has different parameters and number of outputs so was getting addressing errors. I tried to use mx.clear_cache() between calls, but that resulted in incomplete plots for the second call, not sure why. My work around was to add more parameter details to kernel name to avoid the cached version from being used.
This exact issue was fixed in https://github.com/ml-explore/mlx/pull/2242 - but it hasn't been released yet.
Indeed a good workaround is to use a more verbose name.