rocSOLVER
rocSOLVER copied to clipboard
(WIP) Optimize latrd
Here is an attempt to optimize latrd by storing the 2 narrow column panels "A" and "W" in LDS shared memory and using Cooperative Kernel Launch to synchronize all thread blocks in computing the 4 GEMV matrix vector multiplications.
However, there is no improvement in the performance.
Creating this PR in case there can be further improvements later.