GPUArrays.jl
GPUArrays.jl copied to clipboard
Faster (still slow) fallback matrix multiplication
Taken from the KernelAbstractions.jl performant matmul example.
I had to make a few changes, such as using unsafe_indices, since the algorithm itself does the bounds checking, and I was getting wrong results until I added that.
~I also made it so I and J are only fetched once. Not sure if the old way is outdated or to prevent a bug I didn't encounter.~ Edit: Guess i found out why that was there. Why is it only necessary for some backends and why is the other way working for nightly?
Finally, I made tile size 16 instead of 32 since it cannot be set dynamically, and Metal does not always have 1024 (32*32) threads per threadgroup available.