rocSOLVER icon indicating copy to clipboard operation
rocSOLVER copied to clipboard

Use L3 BLAS in LARFT

Open AGonzales-amd opened this issue 1 year ago • 0 comments

This PR introduces a potential optimization to the LARFT routine. The modification aims to reduce the size of the gemv computations and instead offloads the block part of the computation to a call to gemm. Additionally, in some cases the modified method performs worse than the original so the latter is dispatched instead.

Following are performance LARFT figures with configuration k=64, rocblas_forward_direction and rocblas_column_wise (config for QR factorization).

  • Single Precision log_compare_slarftcf_m
  • Double Precision log_compare_dlarftcf_m
  • Complex Single Precision log_compare_clarftcf_m
  • Complex Double Precision log_compare_zlarftcf_m

column_wise and forward_direction has similar performance characteristic while row_wise configurations show improvements for all data types, e.g.,

  • Single Precision compare_slarftrf_m This also shows that the gains with the row_wise configuration is more significant (~x18 vs ~x3 speedup).

Curiously, GEQRF shows weird performance with this modification. These figures were generated with square matrices.

  • Single Precision compare_sgeqrf_m
  • Double Precision compare_dgeqrf_m
  • Complex Single Precision compare_cgeqrf_m
  • Complex Double Precision compare_zgeqrf_m

Although there is mostly performance gains, there are some cases where performance degrades with the modification.

AGonzales-amd avatar Aug 20 '24 23:08 AGonzales-amd