lapack
lapack copied to clipboard
add rotc to blas
This PR adds a new routine to the BLAS: rotc.
The routine accepts an m x (n+1) matrix A and two n x k matrices C and S. The matrices C and S each store a cosine and a sine of a Givens rotation and the routine applies these to the matrix A one by one. The amount of computational work this requires makes this fit for a level 3 BLAS routine.
This routine would be very useful for optimizing eigenvalue algorithms. The QR-SVD algorithm currently uses lasr, which is not really optimized at all. The Hessenberg-triangular reduction and the multishift-QR/QZ implementation also apply chains of rotation sequences, but they accumulate blocks of rotations into bigger matrices than can then be applied using gemm. This routine can be faster than that.
This PR just adds a reference implementation. In a separate repository, I implemented a highly optimized version for double precision to prove that this routine actually can be optimized .
At the moment, only the implementation itself is in this PR. I think I will need some help with the other things like testblas, cblas, ... anyone willing to help?
Some minor things:
- Why BLAS and not LAPACK? Optimizing this routine requires a significant amount of hardware specific optimizations, similar to a matrix-matrix multiplication. This goes against the portable performance ideas of LAPACK.
- Why are the output arguments in the middle
There is somewhat of a convention in BLAS/LAPACK to put output arguments at the end. However, I'm trying to match the level 1 blas routine
rotwith the current interface. - What is
dzrotcand why do we need it? In some cases, you can to apply real rotations to a complex matrix. For example, when computing the SVD of a complex matrix. This variant facilitates that. I think it is important enough to warrant an extra variant, especially since it is not that much extra work to provide an optimized version. It should be able to reuse the real valued kernels. - Why .f90 and not .f? Honestly, I just like looking at the free form files better. There is nothing in the implementation that requires free form or modern fortran so I can change it if it causes issues.
- How much extra work is this for optimized BLAS implementations?
Using a shuffling kernel as I did in my implementation is a bit of work, but accumulating the rotations and then using
gemmalso gives reasonable performance. That can be a good alternative while the optimized implementation is being made.