rocSOLVER
rocSOLVER copied to clipboard
Potential improvement to set/restore_diag in GEQR2
This PR aims to reduce the impact of set_diag and restore_diag kernels to the runtime of GEQR2 indicated by profiling. This is achieved by:
- Combining
larfgandset_diagto reduce the number of global memory reads and writes:- This is achieved by modifying
larfgto write both the unit diagonal and non-unit diagonal values thus eliminating the call toset_diag.
- This is achieved by modifying
- Reduce kernel launch overhead of
set_diagandrestore_diag:set_diagis explained above. Launch overhead ofrestore_diagis reduced by launching the kernel once to restore all diagonal values at the expense of additional memory footprint.
The following chart shows the speedup of geqrf with these changes on real single precision square matrices.
Note:
- I tried the suggestion of using
larfbinstead oflarfbut it performed worse due to increased global memory access. I got similar results with my attempt to modifylarfto assume implicit unit diagonal. - This is my attempt of a solution to this problem and I am open to try other suggestions.