wgpu-mm icon indicating copy to clipboard operation
wgpu-mm copied to clipboard

sgemm performance

Open audiovention opened this issue 1 year ago • 2 comments

hey man, very cool repo. I've been working on some ML stuff with webgpu as well lately, and so I've been playing around with sgemm kernels as well. I took similar path as your repo - started with the very well explained CUDA optimized kernels from https://github.com/siboehm/SGEMM_CUDA

I've taken Kernel 9 from there, ported it, vectorized a bit more and played with parameters, and I get very good performance. Results on 10 Core gpu of M2 mac mini (3.6TFlops limit if I'm not mistaken) 1024x1024x1024 -> 1.2TFlops 2048x2048x2048 -> 2.2TFlops 4096x4096x4096 -> 2.4TFlops 2048x4096x2048 -> 2.6TFlops

I think it's mostly about tuning parameters to properly adjust register pressure. I couldn't reproduce the tinygrad results, but I see that in your script you're running for N=2048 while you're testing webgpu at N=1024, this might lead to some discrepancies, I think at N=1024 you can't possibly saturate the GPU cores.

Let me know if you're interested in the kernels, I'll send you some code.

audiovention avatar Nov 20 '23 09:11 audiovention