GPUArrays.jl Faster (still slow) fallback matrix multiplication

Faster (still slow) fallback matrix multiplication

Open christiangnrd opened this issue 6 months ago • 3 comments

Taken from the KernelAbstractions.jl performant matmul example.

I had to make a few changes, such as using unsafe_indices, since the algorithm itself does the bounds checking, and I was getting wrong results until I added that.

~I also made it so I and J are only fetched once. Not sure if the old way is outdated or to prevent a bug I didn't encounter.~ Edit: Guess i found out why that was there. Why is it only necessary for some backends and why is the other way working for nightly?

Finally, I made tile size 16 instead of 32 since it cannot be set dynamically, and Metal does not always have 1024 (32*32) threads per threadgroup available.

Apr 13 '25 18:04 christiangnrd

GPUArrays.jl GPUArrays.jl copied to clipboard

Faster (still slow) fallback matrix multiplication

GPUArrays.jl
GPUArrays.jl copied to clipboard