lapack icon indicating copy to clipboard operation
lapack copied to clipboard

Add GSVD with QR factorizations, 2-by-1 CS decomposition

Open christoph-conrads opened this issue 4 years ago • 84 comments

This pull request adds a generalized singular value decomposition computed by means of a QR decomposition with column pivoting and the 2-by-1 CS decomposition to LAPACK. The PR requires #405.

Given a pair of matrices A, B with appropriate dimensions, the GSVD computes a decomposition

  • A = U1 D1 R Q^*,
  • B = U2 D2 R Q^*.

I would like to have feedback on the computation of the matrix R. Currently it is always computed but this is expensive because of an additional matrix-matrix multiplication followed by an RQ decomposition. Should I make this optional?

The PR is based on #63 but provides all implementations (single-precision real, double-precision real, single-precision complex, and double-precision complex) with tests removed because of the C++ dependency. The tests can be found in fb5dfb3fc87b3707de81dda6ae7ea2043d7cb247.

christoph-conrads avatar Apr 22 '20 15:04 christoph-conrads

Here is the motivation for the new GSVD solver from #63:

The generalized singular value decomposition can be computed directly with xGGSVD, xGGSVD3 (see LAWN 46) or by computing QR factorizations and a 2-by-1 CS decomposition. In the following, let us call the former approach direct GSVD and the latter QR+CSD. In my master's thesis I used both approaches and it turned out that the QR+CSD was always signifcantly faster than the direct GSVD for pairs of square matrices with up to 3600 columns (see Section 3.5 of my master's thesis).

I am convinced that the speed difference is persistent across platforms because the iterative phase of the solver is much faster for QR+CSD. The direct GSVD reduces both matrices A, B to upper triangular form before the iterative solver xTGSJA is called whereas the iterative phase of QR+CSD operates on a 2-by-1 isometric matrix Q^T = [Q1^T, Q2^T]^T, where Q1, Q2 are bidiagonal (A isometric <=> A^T A = I). Since the matrix Q is always well-conditioned, the CSD solver xBBCSD needs only few iterations to compute a solution. For comparison, the maximum number of iterations per singular value/eigenvalue is

  • 6 for the CSD (see variable MAXITR in xbbcsd.f, line 356),
  • 30 for the symmetric eigenvalue problem with Francis QR (MAXIT, dsteqr.f:154), and
  • 40 for the GSVD (MAXIT, dtgsja.f:402).

christoph-conrads avatar Apr 22 '20 15:04 christoph-conrads

Codecov Report

Merging #406 (2a8c4dd) into master (4b3c7c2) will decrease coverage by 0.43%. The diff coverage is 0.29%.

:exclamation: Current head 2a8c4dd differs from pull request most recent head 8a0bef1. Consider uploading reports for the commit 8a0bef1 to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master     #406      +/-   ##
==========================================
- Coverage   82.37%   81.93%   -0.44%     
==========================================
  Files        1894     1904      +10     
  Lines      190681   191698    +1017     
==========================================
+ Hits       157067   157068       +1     
- Misses      33614    34630    +1016     
Impacted Files Coverage Δ
SRC/cggqrcs.f 0.00% <0.00%> (ø)
SRC/clasrtr.f 0.00% <0.00%> (ø)
SRC/dggqrcs.f 0.00% <0.00%> (ø)
SRC/dlasrti.f 0.00% <0.00%> (ø)
SRC/dlasrtr.f 0.00% <0.00%> (ø)
SRC/sggqrcs.f 0.00% <0.00%> (ø)
SRC/slasrti.f 0.00% <0.00%> (ø)
SRC/slasrtr.f 0.00% <0.00%> (ø)
SRC/zggqrcs.f 0.00% <0.00%> (ø)
SRC/zlasrtr.f 0.00% <0.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4b3c7c2...8a0bef1. Read the comment docs.

codecov[bot] avatar Apr 22 '20 15:04 codecov[bot]

I would like to have feedback on the computation of the matrix R. Currently it is always computed but this is expensive because of an additional matrix-matrix multiplication followed by an RQ decomposition. Should I make this optional?

Yes, because then it is possible to compute only the generalized singular values for the cost of one QR decomposition with column pivoting, unitary matrix assembly, and 2-by-1 CS value computation. AFAIK xGGSVD3() must compute the upper-triangular factor R by design and would be considerably slower in this scenario.

TODO:

  • [x] Re-read LAWN 46 and understand xGGSVD3()algorithm
  • [ ] Benchmark this idea before modifying the code

christoph-conrads avatar Apr 23 '20 14:04 christoph-conrads

TODO:

  • [x] Re-read LAWN 46 and understand xGGSVD3()algorithm

The whole theory behind xGGSVD3() is based on simultaneous transformations of a pair of upper triangular matrices.

christoph-conrads avatar May 02 '20 19:05 christoph-conrads

The latest xGGQRCS code contains row scaling ({s,d}LASRTR) in an attempt to handle matrix pencils where the input matrices A, B differ significantly in norm. This does not even yield satisfactory singular values. The input matrices must be scaled for a backward stable algorithm (backward stable with respect to A, B individually, not with respect to [A, B]).

The remainder of this post will discuss how to compute the backward stable GSVD of (A, B) using matrix scaling and xGGQRCS (QR factorization followed by 2-by-1 CS decomposition).

Set w to the power of 2 nearest to norm(A) / norm(B) and compute the GSVD of [A; wB] with xGGQRCS. This yields orthogonal/unitary matrices U1, U2 and matrices with sine, cosine values S, C such that

  • A = U1 S X,
  • wB = U2 C X.
  • S^* S + C^*C = I.
  • X has full rank.

We now have to adjust X and the singular values to acquire the GSVD of (A, B).

We can immediately fix the singular values because if σ is a generalized singular value of the matrix pencil (A, wB), then is a singular value of (A, B). xGGQRCS does not return singular values. Instead it returns radians values θ such that σ = tan(θ). Thus, the correct radians value is σ' = arctan(w * tan(θ)); the updated matrices Σ, Γ can be computed with the aid of the new radians values (Σ(i,i) = sin(θ'), Γ(i,i) = cos(θ')`). This operation should be numerically stable because

  • The arc tangent has maximum derivative 1 over its entire range and perfectly conditioned.
  • w * tan(θ) will be calculcated accurately because it is a floating-point multiplication.
  • tan(θ) is
    • well conditioned for small angles (less than π/4) [i.e., there are vectors x such that ||wBx|| >= ||Ax||]
    • infinite (or already negative) for angles near π/2; then the multiplication by w does not change anything [i.e., there are vectors x such that ||Ax|| > 0, ||wBx = 0]
  • badly conditioned for θ much larger than π/4 (this should not happen because it would indicate the existence of vectors x with ||wBx|| << ||Ax|| but wB has comparable norm to A).

The question is now if we need to change X and how. Let

|A|   |V1   | |Σ|
| | = |     | | | Y
|B|   |   V2| |Γ|

be the GSVD of (A, B) such that Σ^* Σ + Γ^* Γ = I. Observe

A^* A = X^* S^* S X = Y^* Σ^* Σ Y.
w^2 B^* B = X^* C^* C X = w^2 Y^* Γ^* Γ Y.

Moreover, S^* S and Σ^* Σ are be diagonal which is extremely helpful. For demonstration purposes assume that

S^* S = |0  |
        |  s|

C^* C = |1  |
        |  c|

Σ^* Σ = |0  |
        |  σ|

Γ^* Γ = |1  |
        |  γ|

Furthermore, we assume the (1,1) matrix entry arises through the matrix dimensions (meaning it is free of round-off error). With simple row scalings we can compute Y from X and the old and new singular values:


                      |0 0| |1   0| |1   0|             |0 0|
X^* S^* S X = X^* S^* |   | |     | |     | X = X^* S^* |   | Y.
                      |0 s| |0 σ/s| |0 s/σ|             |0 σ|

                      |1 0| |1   0| |1   0|             |1 0|
X^* C^* C X = X^* C^* |   | |     | |     | X = X^* C^* |   | wY.
                      |0 c| |0 γ/c| |0 c/γ|   =         |0 γ|

Thus

Y[1,:] = X[0,:] / w # first row of Y
Y[2,:] = X[1,:] * s/σ = X[1,:] * c/γ / w # second row of Y

Observe that S(1,1) zero but C(1,1) is not so there is a unique row scaling for the first row of X (divide by w). Similarly, C(i,i) might be zero but then S(i,i) must be one. Moreover, whenever S(i,i) =/= 0, C(i,i) =/= 0, there are two equations for each row. Both equations should yield the same result.

Here is a NumPy demo with a random test matrix causing a large backward error with the current xGGQRCS version:

#!/usr/bin/python3

# This program demonstrates how to adjust singular values and right-hand side
# matrix of the generalized singular value decomposition computed with SGGQRCS
# when one of the input matrices had to be scaled.
#
# Author: Christoph Conrads (https://christoph-conrads.name) 

import numpy as np
import numpy.linalg as L


def make_matrix(x):
    return np.matrix(x, dtype=np.float32)


def main():
    eps = np.finfo(np.float32).eps

    # unmodified input matrices with norm(B) >> norm(A)
    a = make_matrix([[-8.519847412e+02, +6.469862671e+02]])
    b = make_matrix([
        [+5.485938125e+05, -4.166526250e+05],
        [+1.846850781e+05, -1.402660781e+05],
        [+5.322575625e+05, -4.042448438e+05],
        [-1.630551465e+04, +1.238360352e+04],
        [-1.286453438e+05, +9.770555469e+04],
        [-1.323287812e+05, +1.005026797e+05],
        [+5.681228750e+05, -4.314841250e+05],
        [-3.107875312e+05, +2.360408594e+05],
        [+1.456551719e+05, -1.106233281e+05],
        [+1.365355156e+05, -1.036972344e+05]
    ])
    # GSVD computed by SGGQRCS with input A, B/(2**10)
    u1 = make_matrix([[1]])
    # SGGQRCS returns square matrices but we can ignore the other columns
    # because the matrix pencil has only rank 2
    u2 = make_matrix([
        [-5.237864256e-01, +3.081814051e-01],
        [-1.694306731e-01, -5.048786998e-01],
        [-5.036462545e-01, -1.015115157e-01],
        [+1.279635727e-02, +2.352355272e-01],
        [+1.268841326e-01, -4.299084842e-01],
        [+1.265842170e-01, -9.544299543e-02],
        [-5.364804864e-01, -2.056131810e-01],
        [+2.987869084e-01, -3.556257784e-01],
        [-1.335151047e-01, -4.078340828e-01],
        [-1.268201172e-01, -2.355325669e-01]
    ])
    x = make_matrix([
        [-1.029685547e+03, +7.820372925e+02],
        [-8.520648804e+02, +6.470472412e+02]
    ])
    # SGGQRCS returns radians values instead of singular values
    theta = np.float32(1.55708992481e+00)
    # scaling factor
    w = np.power(np.float32(2), np.float32(-10))

    # check for typos
    assert L.norm(u2.H*u2 - np.eye(2)) <= np.sqrt(2) * eps
    assert theta >= 0
    assert theta <= np.pi / 2

    # add helper functions
    copy = lambda x: np.matrix(np.copy(x))
    compute_relative_error = \
        lambda a, u, d, x: L.norm(a - u*d*x) / (L.norm(a) * eps)


    # assemble diagonal matrices
    d1 = make_matrix([[0, np.sin(theta)]])
    d2 = make_matrix([[1, 0], [0, np.cos(theta)]])


    # print information
    print('norm(B) / norm(A) = {:8.2e}'.format(L.norm(b)/L.norm(a)))
    print('Relative error A: {:8.2e}'.format(compute_relative_error(a,u1,d1,x)))
    print('Relative error B: {:8.2e}'.format(compute_relative_error(b,u2,d2,x)))


    # fix values
    theta_fixed = np.arctan(w * np.tan(theta))
    d1_fixed = make_matrix([[0, np.sin(theta_fixed)]])
    d2_fixed = make_matrix([[1, 0], [0, np.cos(theta_fixed)]])

    x_fixed = copy(x) / w
    # alternative: x_fixed[1,:] *= d1[0,1]/d1_fixed[0,1]
    x_fixed[1,:] *= d2[1,1]/d2_fixed[1,1]


    # recompute backward error
    fmt = 'Relative error {:s} after fixes: {:8.2e}'
    print(fmt.format("A", compute_relative_error(a,u1,d1_fixed,x_fixed)))
    print(fmt.format("B", compute_relative_error(b,u2,d2_fixed,x_fixed)))


if __name__ == '__main__':
    main()

Script output on my computer:

norm(B) / norm(A) = 1.24e+03
Relative error A: 1.73e+00
Relative error B: 8.38e+06
Relative error A after fixes: 2.71e+00
Relative error B after fixes: 1.49e+00

christoph-conrads avatar May 02 '20 21:05 christoph-conrads

@julielangou It's done.

Given a pair of matrices (A, B), xGGQRCS computes the GSVD such that A = U1 D1 X, B = U2 D2 X.

The tests can be found in the branch qr+cs-gsvd in TESTING/cpp/ggqrcs.hpp due to the C++ and Boost.Test dependency.

Bonus functionality (not used by xGGQRCS because row sorting has no effect):

  • {s,d}LASRTR: row sorting
  • {s,d}LASRTI: sort an array of indices based on the values they references (used by xLASRTR)

christoph-conrads avatar May 07 '20 20:05 christoph-conrads

The branch qr+cs-gsvd with the C++ tests and approximately 90% line coverage for xGGQRCS can be found here: https://travis-ci.org/github/christoph-conrads/lapack/branches

christoph-conrads avatar May 13 '20 10:05 christoph-conrads

Benchmarks

This post compares the performance of the GSVD solvers xGGSVD3 and xGGQRCS.

Introduction

Of course, the comparison will be very limited because of the huge number of parameters relevant to the GSVD, e.g.,

  • the matrix dimensions,
  • generalized singular value distribution,
  • sparsity (all-zero columns, rows),
  • matrix condition numbers,
  • floating-point format,
  • field (real or complex values),
  • computing all or none of the GSVD matrices,
  • BLAS implementation (Netlib, OpenBLAS, MKL...),
  • number of threads,
  • compilers,
  • CPU
    • instruction set architecture
    • CPU architecture
  • ...

In this post we benchmark with matrices A = U1 D1 X, B = U2 D2 X, where

  • A, B have the same number of rows,
  • the number of rows and columns is between 8 and 128 inclusive,
  • the generalized singular values are uniformly distributed either in
    • [0, π/4),
    • [0, π/2000), or
    • [π1999/8000, π/4),
  • X has condition number 2**(d/4), d being the number of significand digits.
  • U1, U2 are random,
  • all matrices must be computed or none,
  • Netlib BLAS,
  • CPU time is measured,
  • at least 100 random matrices must be decomposed and the CPU time must sum up to more than one second.

Code: TESTING/cpp/gsvd_benchmarks.cpp

Attention: xGGSVD3 computes a decomposition A = U1 D1 R Q^*, B = U2 D2 R Q^*. Depending on the problem at hand, it may not suffice to measure only execution time of the GSVD solver.

Results

  • xGGQRCS is faster than xGGSVD3 if A, B are (almost) square.
  • xGGQRCS is the slowest relative to xGGSVD3 if A, B have only a small number of columns compared to the number of rows. (Remember A, B have the same number of rows in the benchmarks.)
  • Switching from real to complex arithmetic has a smaller impact on CPU time for xGGQRCS than for xGGSVD3.
  • When computing only generalized singular values, xGGQRCS has a smaller CPU time reduction than xGGSVD3.
  • When computing only generalized singular values and for matrices with more columns than rows, the larger the number of columns, the more xGGQRCS slows down relative to xGGSVD3.
  • The larger the matrices, the larger the maximum differences in CPU time.
  • xGGQRCS needs a larger workspace than xGGSVD3.
  • The larger the number of rows, the larger the workspace of xGGQRCS relative to xGGSVD3.

Plots

The plots below compare the CPU time consumed by the xGGQRCS compared to xGGSVD3 in single-precision when

  • computing generalized singular values and matrices (upper row),
  • computing only generalized singular values (lower row),

in

  • real arithmetic (SGGQRCS, SGGSVD3) in the left half,
  • complex arithmetic (CGGQRCS, CGGSVD3) in the right half.

The colors indicate the number of matrix rows; the brighter the colors, the larger the number of rows.

The x axis shows the number of columns in the matrices, the y axis shows the CPU time needed by xGGQRCS divided by the CPU time needed by xGGSVD3.

xGGQRCS-vs-xGGSVD3-cputime-float32

christoph-conrads avatar May 17 '20 11:05 christoph-conrads

I would worry about one thing: i'm pretty sure the SVD is a little better at identifying the numerical rank than RRQR. I haven't looked into this, but is it possible that the direct GSVD is then also better at this than this method?

The backward error introduced by a QR transformation of a matrix K is on the order of eps ||K|| (see, e.g., Matrix Computations, 4th ed., Section 5.2) meaning all small singular values of the R factor will have a large absolute error. Consequently, this makes rank determination heuristic.

Both GSVD and QR+CS initially apply one or more permuted QR factorizations, see xGGSVP3 or Computing the Generalized Singular Value Decomposition (Equation (1.5)) and xGGQRCS. Based on the fact that the QR factorization is a large backward error source, I do not see one method being generally better than the other with respect to rank determination.

christoph-conrads avatar Jan 16 '21 11:01 christoph-conrads

@thijssteel Please let me know if you are satisfied by the changes.

christoph-conrads avatar Feb 14 '21 19:02 christoph-conrads

Benchmarks

This post compares the performance of the GSVD solvers xGGSVD3 and xGGQRCS.

[snip]

Plots

The plots below compare the CPU time consumed by the xGGQRCS compared to xGGSVD3 in single-precision when * computing generalized singular values and matrices (upper row), * computing only generalized singular values (lower row), in * real arithmetic (SGGQRCS, SGGSVD3) in the left half, * complex arithmetic (CGGQRCS, CGGSVD3) in the right half.

The colors indicate the number of matrix rows; the brighter the colors, the larger the number of rows.

The x axis shows the number of columns in the matrices, the y axis shows the CPU time needed by xGGQRCS divided by the CPU time needed by xGGSVD3.

Intel Cascade Lake results (commit 6a61522bd8fd74dadf406bef1c628f94b3947216) on a virtual private server. Note that CGGQRCS takes now more than twice the CPU time of CGGSVD3 for matrix pairs with less than 20 columns when only singular values are requested. This difference was less pronounced in the last set of benchmarks.

xGGQRCS-vs-xGGSVD3-CascadeLake-20210214 gsvd-benchmarks-6a61522bd8fd74dadf406bef1c628f94b3947216.log

plot-gsvd-benchmark.py.gz

christoph-conrads avatar Feb 14 '21 21:02 christoph-conrads

How consistent is the scaling behaviour observed in the benchmarks? Would it be reliable enough to specify a convenience function (e.g. for SV-only) that switches between xGGSVD3 and xGGQRCS in a way that (mostly) avoids the doubled time for small n,m, but benefits in the larger regime? Because I'd be very tempted to do that as a user (if I had benchmarked things myself), but it would be much nicer if the library already provided that, than having to come with it oneself.

Eyeballing the graphs, I'm thinking of something along the lines of n <= 20 || m/n < 3 or perhaps log(m)/n < ?...

h-vetinari avatar Feb 24 '21 08:02 h-vetinari

Hi @h-vetinari,

Yes this is a good idea.

(1) we could write higher levels wrappers that would call the "best" function. The current structure (clean interface for a stand alone xGGSVD3 and clean interface for a stand alone xGGQRCS) is good with me. What we might be missing is higher levels wrappers.

(2) However my preference is, for now, to leave this kind of tuning for higher levels (than LAPACK). Higher levels like Matlab, Octave, Numpy, Scipy, etc. This kind of tuning could be present a lot in LAPACK. For example for QR, there is TSQR (for tall and skinny matrices) and there is the standard QR algorithm. When n < a_function( m ), then TSQR is better than QR. For SYEV, there are four different algorithms (SYEVR, SYEV, SYEVX, SYEVD). Sometimes there are switches from one algorithm to the next when the choice is "obvious", but as soon as the choice gets complicated / machine dependent, these switches are not there.

(3) For now, we could have some comments in the header of these subroutines to give brief explanations to the users and the various choices. Also timings as done by @christoph-conrads are useful to explain to users the trade-off.

(4) For the future, I am not against starting some higher level wrappers. Like QR(m,n,A,lda). This would leave opportunities for software like MKL, OpenBLAS, to optimize the algorithm used by these functions. We could provide reference implementation for these high level wrappers. We should look at what is in LAPACKE.

So bottom line, my personal opinion is to leave things as is for now. But it is a possible to-come-soon agenda items. Opinions welcome.

Cheers, Julien.

langou avatar Feb 24 '21 14:02 langou

Nice job @christoph-conrads ! I think we should add tests for the new routines. Maybe we can just replicate the tests for xCGGSVD3. What do you think?

weslleyspereira avatar Feb 24 '21 17:02 weslleyspereira

Nice job @christoph-conrads ! I think we should add tests for the new routines. Maybe we can just replicate the tests for xCGGSVD3. What do you think?

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

christoph-conrads avatar Feb 24 '21 18:02 christoph-conrads

Would it be reliable enough to specify a convenience function (e.g. for SV-only) that switches between xGGSVD3 and xGGQRCS in a way that (mostly) avoids the doubled time for small n,m, but benefits in the larger regime?

No, because there is no such regime. Consider the matrix pencil (A, B) below:

    |1 0|      |0 0|
A = |   |  B = |   |
    |0 0|      |0 1|

Judging only by the dimensions, this is the optimal case xGGQRCS yet in practice xGGSVD3 will be much faster here.

The xGGSVD3 preprocessing(!) stage will determine after executing two QR factorizations that this matrix pencil has the singular value pairs (1, 0) and (0, 1) (or that it has the singular values zero and infinity). xGGQRCS must actually execute in its entirety to arrive at the same conclusion. The critical property in this example are the nullspaces of the matrices and not their dimensions.

This is of course the counter-example from mathematics but I really doubt that such a switch is reliably possible in practice because there are too many relevant variables including: number of rows (matrix A), number of rows (matrix B), number of columns, rank of A, rank of B, rank of (A, B), whether or not to assemble the matrices U1, U2, and X. This list contains only quantities from mathematics. In addition, there are factors like the BLAS implementation, CPU architecture, or the number of cores.

Finally, the major problem with xGGQRCS is its large memory consumption in comparison to xGGSVD3. A method deciding on the fly between xGGQRCS and xGGSVD3 would burden the user with the xGGQRCS storage thirst.

christoph-conrads avatar Feb 24 '21 18:02 christoph-conrads

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR.

weslleyspereira avatar Feb 24 '21 18:02 weslleyspereira

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR.

(Emphasis mine)

Adopting xGSVTS3 is not trivial because xGGSVD3 computes factorizations of the form A = U C R V^* while xGGQRCS computes A = U C X. I will try to get it done this weekend.

christoph-conrads avatar Feb 24 '21 18:02 christoph-conrads

There are tests in the christoph-conrads/qr+cs-gsvd branch in TESTING/cpp/xGGQRCS_tests.cpp and the tests are passing in Travis CI. The tests are not part of the pull request because they are in C++ and require the Boost C++ libraries.

Yes, I saw these tests. I just thought a quick way to get the new methods covered by the current Lapack/TESTING/EIG would be to adapt CERRGG and CGSVTS3 subroutines. And maybe add your new tests in a different PR.

(Emphasis mine)

Adopting xGSVTS3 is not trivial because xGGSVD3 computes factorizations of the form A = U C R V^* while xGGQRCS computes A = U C X. I will try to get it done this weekend.

*       SUBROUTINE SERRGG( PATH, NUNIT )                                                                                                
*                                                                                                                                       
*       .. Scalar Arguments ..                                                                                                          
*       CHARACTER*3        PATH                                                                                                         
*       INTEGER            NUNIT                                                                                                        
* [snip]
      ELSE IF( LSAMEN( 3, PATH, 'GQR' ) ) THEN                                                                                          
*                                                                                                                                       
*        SGGQRF                                                                                                                         
*                                                                                                                                       
         SRNAMT = 'SGGQRF'                                                                                                              
         INFOT = 1                                                                                                                      
         CALL SGGQRF( -1, 0, 0, A, 1, R1, B, 1, R2, W, LW, INFO )                                                                       
* [snip]
*                                                                                                                                       
*        SGGRQF                                                                                                                         
*                                                                                                                                       
         SRNAMT = 'SGGRQF'

@weslleyspereira Which three letter identifier should I use? If GQR is reused (xGGQRCS), the code would be tested together with the QR and RQ factorization.

christoph-conrads avatar Feb 24 '21 19:02 christoph-conrads

@weslleyspereira Which three letter identifier should I use? If GQR is reused (xG_GQR_CS), the code would be tested together with the QR and RQ factorization.

I see... Maybe just xQRCS would be clear enough, no?

weslleyspereira avatar Feb 24 '21 19:02 weslleyspereira

@weslleyspereira Which three letter identifier should I use? If GQR is reused (xG_GQR_CS), the code would be tested together with the QR and RQ factorization.

I see... Maybe just xQRCS would be clear enough, no?

Think like it's 1977, not 2021:

*       CHARACTER*3        PATH                                                                                                         

christoph-conrads avatar Feb 24 '21 20:02 christoph-conrads

Ah... now I saw your mention to the three-letter PATH. I thought it could be the SRNAMT id, sorry. I would suggest gQR, but the cases don't matter. This takes me to QRC, which doesn't sound good, or UCX, which represents well the final decomposition format but not the routine's name. I don't know...

weslleyspereira avatar Feb 24 '21 20:02 weslleyspereira

How consistent is the scaling behaviour observed in the benchmarks?

(1) we could write higher levels wrappers that would call the "best" function. The current structure (clean interface for a stand alone xGGSVD3 and clean interface for a stand alone xGGQRCS) is good with me. What we might be missing is higher levels wrappers.

To have xGGQRCS outperform xGGSVD3 more consistently, xGGQRCS could preprocess the matrices with the function xGGSVP3.

xGGSVP3 computes at most four QR factorizations to detect

  • the common nullspace of the matrices,
  • the singular value pairs (1,0) (singular value infinity), and
  • the singular value pairs (0,1) (singular value zero).

xGGSVP3 reduces the matrices to upper triangular form which xGGQRCS does not need but the preprocessing step would stop xGGQRCS from underperforming when the matrices are heavily rank deficient.

christoph-conrads avatar Feb 27 '21 19:02 christoph-conrads

xGGQRCS vs xGGSVD3 with OpenBLAS instead of Netlib BLAS

The plots have a different colormap to distinguish OpenBLAS from Netlib BLAS plots.

Baseline, one thread, looks very similar to Netlib BLAS: xGGQRCS-vs-xGGSVD3-OpenBLAS-baseline txt xGGQRCS-vs-xGGSVD3-OpenBLAS-baseline.txt

With 8 cores and OpenMP parallelism: xGGSVD seems to benefit from parallelism when computing complex-valued matrices; noisy data otherwise xGGQRCS-vs-xGGSVD3-OpenBLAS-core8 txt xGGQRCS-vs-xGGSVD3-OpenBLAS-core8.txt

The xGGSVD3 "best" case (all singular value pairs are randomly chosen to be (1,0) or (0,1), single-threaded): xGGQRCS can be 100% slower than xGGSVD3 with; xGGQRCS is still slower when computing matrices in complex arithmetic; I do not understand why these patterns arise xGGQRCS-vs-xGGSVD3-OpenBLAS-01 txt xGGQRCS-vs-xGGSVD3-OpenBLAS-01.txt lapack-01-gsvd.patch.txt

Test matrices where 25% of the singular value pairs are (0,1), 25% of the singular value pairs are (1,0), single-threaded: xGGQRCS performance relative to xGGSVD3 is much less affected in complex arithmetic; in real arithmetic xGGQRCS performance gains are delayed and xGGSVD3 may be consistently faster when computing only singular values xGGQRCS-vs-xGGSVD3-OpenBLAS-random01 txt xGGQRCS-vs-xGGSVD3-OpenBLAS-random01.txt lapack-random01-gsvd.patch.txt

christoph-conrads avatar Feb 28 '21 19:02 christoph-conrads

@langou FYI it is not possible for me to continue working on this issue before the second half of April.

christoph-conrads avatar Apr 03 '21 14:04 christoph-conrads

One might consider rebasing out the commits that were reverted again directly. Aside from being superfluous for the most granular nuances of the development history, the message of (e.g.) https://github.com/Reference-LAPACK/lapack/pull/406/commits/29c19c1ef20a1a0da43d7b88e503b588d96be6ee says it reverts a commit that's not on this branch anymore.

h-vetinari avatar Apr 21 '21 11:04 h-vetinari

One might consider rebasing out the commits that were reverted again directly. Aside from being superfluous for the most granular nuances of the development history, the message of (e.g.) 29c19c1 says it reverts a commit that's not on this branch anymore.

There are three reasons why I would like to keep all commits:

  1. If I began squashing, removing, or editing commits, the development history turns into a novel.
  2. In finite precision arithmetic, the textbook computation of the GSVD with QR factorization and CS decomposition does not work when the input matrices differ significantly in norm. To the best of my knowledge, this problem has not been discussed in the literature. Before commit bf662501323588864b20dac3327e4c2570ed4a67, the scaling problem was resolved by rescaling the matrices but this modifies the singular values. This commit demonstrates that simply fixing the singular values does not work in finite precision arithmetic (it does work in real arithmetic though); this was not obvious (to me) because there is no literature.
  3. In finite precision arithmetic, the textbook computation of the GSVD with QR factorization and CS decomposition does not work when the input matrices differ significantly in norm with the key problem being the QR factorization. Although it uses column pivoting, its computation is not backward stable with respect to the matrices A and B individually but QR factorization with column pivoting and row sorting is (row sorting by row norm in descending order). Yet the combination of row sorting with column pivoting does not work in practice even though the decomposition has a small element-wise backward error! This is an unexpected result from an analytical point of view so the commit b53ddbf32d4f930cd28c3c4f7a680b2992f86df2 should stay in case I made an error.

christoph-conrads avatar Apr 21 '21 17:04 christoph-conrads

One might consider rebasing out the commits that were reverted again directly. Aside from being superfluous for the most granular nuances of the development history, the message of (e.g.) 29c19c1 says it reverts a commit that's not on this branch anymore.

  1. There is a fourth commit being reverted. At some point in the code, there is a matrix-matrix multiplication and one of the matrices is known to be upper triangular yet the code employs xGEMM instead of xTRMM. The use of xGEMM should be more expensive than xTRMM because in addition to doing more work, the lower triangular part must be zeroed. In my experiments, xGEMM was faster and this is the information gathered from commit 8903e97d8e12473151017fc984a36bd0c6d7f7ed and its revert.

christoph-conrads avatar Apr 21 '21 17:04 christoph-conrads

If I began squashing, removing, or editing commits, the development history turns into a novel.

If you use fixup option instead of squash, it is as if that commit never happened (however its contents are kept), so in fact the history gets shorter. amending is also very good for making adjustments after a commit. This summary is a nice starting point I believe https://belev.dev/some-of-the-most-used-git-interactive-rebase-options

ilayn avatar Apr 21 '21 18:04 ilayn

Hi @christoph-conrads !

I received a few comments by email about GGQRCS. Mostly very positive reviews! However, there are some outstanding questions and (1) I do not know the answers to these, and (2) I think it would be good for the community to know your point of view and what you know about all this.

(1) The scaling of A and B in GGQRCS is a great addition. This could also improve other gsvd algorithms where we do the same thing but don't do any scaling. Do you know if there is a paper describing how the weight is inserted back into the decomposition in the end? We can read the code, but the derivation would be great to have. We could then think about using it in other gsvd algorithms and have a paper to reference.

(2) Tolerance for the rank: maybe it would be interesting to have an input to GGQRCS that allows users to set a tolerance for how the rank is determined. (If this input parameter is negative, it could mean a default value is computed internally as is currently done.) On the one hand, since a small change in the tolerance could change the size of the output matrices, it could be useful for users to have this option. On the other hand, since there is some scaling applied (qr([A; wB]), the tolerance could be harder to describe and for a user to choose. For one thing, it might need to be just a scaling (between 0 and 1) applied to the norm of the whole matrix [A; wB].

(3) Representing C and S: Here C and S are two separate vectors, but in CSD2by1, they are defined as a vector of angles. Some could expect that the vector of angles would be less accurate when the angle is close to pi/2. Perhaps the scaling improves this by making it less likely for the scaled problem to have these extreme angles close to 0 and pi/2? Just curious.

(4) Can you clarify what's meant by this statement: "DGGQRCS also offers no guarantees which of the two possible diagonal matrices is used for the matrix factorization."? Which two diagonal matrices are meant here?

(5) To compute CSD2by1, one can call SVD and QR directly. This is stable unless both A and B have more columns than rows (Van Loan 1985). Otherwise, the code "to compute CSD2by1, one can call SVD and QR directly" does something that seems to work but has no proof of any bounds. LAPACK's CSD2by1 can be used instead, but in terms of performance it's hard for CSD2by1 to compete with *GESDD, so it was noticeably slower.

langou avatar Apr 21 '21 18:04 langou