Use L3 BLAS in LARFT
This PR introduces a potential optimization to the LARFT routine. The modification aims to reduce the size of the gemv computations and instead offloads the block part of the computation to a call to gemm. Additionally, in some cases the modified method performs worse than the original so the latter is dispatched instead.
Following are performance LARFT figures with configuration k=64, rocblas_forward_direction and rocblas_column_wise (config for QR factorization).
- Single Precision
- Double Precision
- Complex Single Precision
- Complex Double Precision
column_wise and forward_direction has similar performance characteristic while row_wise configurations show improvements for all data types, e.g.,
- Single Precision
This also shows that the gains with the row_wise configuration is more significant (~x18 vs ~x3 speedup).
Curiously, GEQRF shows weird performance with this modification. These figures were generated with square matrices.
- Single Precision
- Double Precision
- Complex Single Precision
- Complex Double Precision
Although there is mostly performance gains, there are some cases where performance degrades with the modification.