wgpu-mm
wgpu-mm copied to clipboard
sgemm performance
hey man, very cool repo. I've been working on some ML stuff with webgpu as well lately, and so I've been playing around with sgemm kernels as well. I took similar path as your repo - started with the very well explained CUDA optimized kernels from https://github.com/siboehm/SGEMM_CUDA
I've taken Kernel 9 from there, ported it, vectorized a bit more and played with parameters, and I get very good performance. Results on 10 Core gpu of M2 mac mini (3.6TFlops limit if I'm not mistaken) 1024x1024x1024 -> 1.2TFlops 2048x2048x2048 -> 2.2TFlops 4096x4096x4096 -> 2.4TFlops 2048x4096x2048 -> 2.6TFlops
I think it's mostly about tuning parameters to properly adjust register pressure. I couldn't reproduce the tinygrad results, but I see that in your script you're running for N=2048 while you're testing webgpu at N=1024, this might lead to some discrepancies, I think at N=1024 you can't possibly saturate the GPU cores.
Let me know if you're interested in the kernels, I'll send you some code.