Use L3 BLAS in LARFT

Open AGonzales-amd opened this issue 1 year ago • 0 comments

This PR introduces a potential optimization to the LARFT routine. The modification aims to reduce the size of the gemv computations and instead offloads the block part of the computation to a call to gemm. Additionally, in some cases the modified method performs worse than the original so the latter is dispatched instead.

Following are performance LARFT figures with configuration k=64, rocblas_forward_direction and rocblas_column_wise (config for QR factorization).

Single Precision
Double Precision
Complex Single Precision
Complex Double Precision

column_wise and forward_direction has similar performance characteristic while row_wise configurations show improvements for all data types, e.g.,

Single Precision This also shows that the gains with the row_wise configuration is more significant (~x18 vs ~x3 speedup).

Curiously, GEQRF shows weird performance with this modification. These figures were generated with square matrices.

Single Precision
Double Precision
Complex Single Precision
Complex Double Precision

Although there is mostly performance gains, there are some cases where performance degrades with the modification.

Aug 20 '24 23:08 AGonzales-amd