Philip Turner

Results 340 comments of Philip Turner

That 1024 is just to fool the compiler. It doesn't allocate any blocks. You can increase the accumulator size by trying 64x64, 80x80, or 96x96 instead of 48x48. > I'm...

Out of curiosity, what is MPS performance for Float16?

Also if it's just 1 split, you could fine-tune (40x40, 56x56, 72x72). I think I know at least how to get equal performance to MPS, if you use a very...

I doubt it. Also, M3 is not that hard, compared to M1. I just haven’t had any factor motivating me to patch up MFA yet.

FP32 is the first step before delving into more advanced types. Such as truncated FP32 (brain float) and FP32 with no dynamic range (half precision). https://gist.github.com/philipturner/3bda14e876a635e73745c42f2eb240c8 Optimal block sizes: 32x32x32...

Closed due to the issue being stale. The entire repository has been overhauled. It targets a very large combinatorial matrix of hardware + problem size + precision + whatever features...

You would have to call into Metal directly using the PyObjC bindings - check out how Tinygrad does this. Or use a combination of Swift and PythonKit. However, H3 seems...

> Do you think FlashConv will be faster in metal compared to CUDA ? There is no direct way to compare "speed" between Metal and CUDA, as they run on...

Some first attempts to remedy: 1) What OS version are you on? I compiled this successfully on later macOS 13 minor versions and macOS 14 Beta. Are you directing the...