(WIP) Optimize latrd

Open EdDAzevedo opened this issue 9 months ago • 0 comments

Here is an attempt to optimize latrd by storing the 2 narrow column panels "A" and "W" in LDS shared memory and using Cooperative Kernel Launch to synchronize all thread blocks in computing the 4 GEMV matrix vector multiplications.

However, there is no improvement in the performance.

Creating this PR in case there can be further improvements later.

Mar 14 '25 14:03 EdDAzevedo