lapack
lapack copied to clipboard
Implement xGEMMT and cblas_xGEMMT
Description
The xGEMMT function implements the GEMM operation but only the lower or the upper triangle of C is accessed. The routine is provide by many optimized BLAS libraries but not by the reference implementation. Due to the fact, that projects like MUMPS used it makes it necessary to provide this function in the reference implementation. This enables projects, that used this function to in runtime environments that allow the exchange of the BLAS libraries, like update-alternative on Debian/Ubuntu or FlexiBLAS.
The xGEMMT function are provided for example by:
- https://www.ibm.com/docs/en/essl/6.2?topic=mos-dgemmt-combined-matrix-multiplication-addition-general-matrices-their-transposes-conjugate-transposes-but-update-only-upper-lower-triangular-portion-array
- https://developer.arm.com/documentation/101004/2100/BLAS-Basic-Linear-Algebra-Subprograms/BLAS-Extensions/dgemmt
- https://github.com/HPAC/ReLAPACK/blob/master/src/dgemmt.c
- https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-1/gemmt.html
- https://github.com/xianyi/OpenBLAS/pull/3796
- https://github.com/flame/blis/blob/master/docs/BLISTypedAPI.md
The PR includes subroutines, the tests, and the CBLAS interface.
Checklist
- [x] The documentation has been updated.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 0.00%. Comparing base (
69992ad
) to head (f90c7ae
). Report is 8 commits behind head on master.
:exclamation: Current head f90c7ae differs from pull request most recent head c57c156
Please upload reports for the commit c57c156 to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## master #887 +/- ##
=========================================
Coverage 0.00% 0.00%
=========================================
Files 1937 1918 -19
Lines 190566 188614 -1952
=========================================
+ Misses 190566 188614 -1952
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks Martin.
Two parts. (1) overall in favor. (2) rant against this specific GEMMT routine. To help me: who knows usage of GEMMT?
(1) Oh my. Adding a routine in the reference Level 3 BLAS? I am overall in favor since GEMMT already exists in many BLAS implementations. And I am happy to see Level 3 BLAS evolves and adapts.
(2-1) I must say that overall I am not a big fan of this GEMMT routine. I wonder why people need it. It seems popular enough. In many libraries. (Thanks for the survey.) But why do users need it? What are some examples of usage?
(2-2) I would rather see a routine like C <- alpha * A * D * A^T + beta * C
where C
is n-by-n symmetric in input and D
is m-by-m symmetric in input too. (And so A
is n-by-m.) That would be useful. In the BLAS Technical Forum Standard, they had BLAS_xSY_TRIDIAG_RK where, in addition of D being symmetric, the matrix D was symmetric tridiagonal. This BLAS_xSY_TRIDIAG_RK routine would be useful too.
(2-3) I wonder whether users need C <- alpha * A * D * A^T + beta * C
and then they first perform GEMM with B <= 1. * A * D + 0. * B
and then GEMMT with C <- alpha * A * B + beta * C
. Well, if that's the case, it would be really, in my opinion, much better to use C <- alpha * A * D * A^T + beta * C
.
(2-4) If we use C <- alpha * A * D * A^T + beta * C
, we can reduce data movement from A and A^T by "combining" the data movement of A and A^T as shown in
https://doi.org/10.1145/3490148.3538587
https://doi.org/10.1109/SC41404.2022.00034
https://doi.org/10.1145/3558481.3591072
as opposed to use a GEMM follows by a GEMMT.
(2-5) Also I do not like the fact that the symmetry of A*B^T
relies on what A
and B
are. Argh. I do not like this. When we look at the interface C <- alpha * A * D * A^T + beta * C
, C
in input is 100% symmetric because we assume the upper part from the lower part, D
in input is 100% symmetric because we assume the upper part from the lower part. So the output C
is 100% symmetric. There is no arguing. C
in output is symmetric. When we look at C <- alpha * A * B + beta * C
. Argh. The symmetry of output C
relies on the "quality" of A
and the "quality" of B
and the "quality" of A*B
. I think there is more uncertainty on whether A*B^T
or B^T * A
is the same. It should be the same. Is it? With the GEMMT interface, it is far from clear which one you should use. Argh. I think this is error prone, and not good for stability. I do not have good examples. But I do not like it.
(2-6) Speaking of C <- alpha * A * D * A^T + beta * C
, I am not sure what a reference
algorithm would be. Mmm. Is there a "more stable way" to perform A * D * A^T
?
(2-7) One argument that I can see is for GEMMT is that the users as a fast
way to right-multiply symmetric matrix D
by A
. So, the user wants to do C <- alpha * A * D * A^T + beta * C
, but they have a fast way to do B <= A * D
. So then yes, GEMMT can be useful.
And let me add. The name. GEMMT. I understand that this is institutionalized in several BLAS libraries. Fine with me. But I find GEMMT a very cryptic name. I would have not have guessed the functionality from the GEMMT name.
I share the doubts about adding industry standard extensions to what is (or used to be?) seen as the reference implementation of something that was largely agreed on by working groups decades ago. Though to some extent that would hold for Reference-LAPACK as well - is it the classical reference if old algorithms get upgraded and new ones added, or is it the new and improved UCDenver LAPACK ?
(2-1) I must say that overall I am not a big fan of this GEMMT routine. I wonder why people need it. It seems popular enough. In many libraries. (Thanks for the survey.) But why do users need it? What are some examples of usage?
At least some library using it, is MUMPS 5.6.1 in src/{s,c,d,z}fac_front_aux.F
. Doing a short research, it shows, that they are using this in an LDL^T
factorization, updating the L
matrices. Thus, this seems to the C <- alpha * A * D * A^T + beta * C
approach.
(2-2) I would rather see a routine like
C <- alpha * A * D * A^T + beta * C
whereC
is n-by-n symmetric in input andD
is m-by-m symmetric in input too. (And soA
is n-by-m.) That would be useful. In the BLAS Technical Forum Standard, they had BLAS_xSY_TRIDIAG_RK where, in addition of D being symmetric, the matrix D was symmetric tridiagonal. This BLAS_xSY_TRIDIAG_RK routine would be useful too.
IMHO, only allowing D to be a symmetric tridiagonal is a too large restriction.
(2-3) I wonder whether users need
C <- alpha * A * D * A^T + beta * C
and then they first perform GEMM withB <= 1. * A * D + 0. * B
and then GEMMT withC <- alpha * A * B + beta * C
. Well, if that's the case, it would be really, in my opinion, much better to useC <- alpha * A * D * A^T + beta * C
.
For the most application cases I could think off, the combined variant would be better. For example, when solving a Lyapunov equation with the Bartels-Stewart algorithm and the Schur-Decomposition, one ends up with these type of updates.
(2-4) If we use
C <- alpha * A * D * A^T + beta * C
, we can reduce data movement from A and A^T by "combining" the data movement of A and A^T as shown in https://doi.org/10.1145/3490148.3538587 https://doi.org/10.1109/SC41404.2022.00034 https://doi.org/10.1145/3558481.3591072 as opposed to use a GEMM follows by a GEMMT.
The is also an implementation of this functionality in https://github.com/SLICOT/SLICOT-Reference/blob/main/src/MB01RU.f, which is used in the case I mentioned in (2-3). But the approach is horribly slow.
(2-5) Also I do not like the fact that the symmetry of
A*B^T
relies on whatA
andB
are. Argh. I do not like this. When we look at the interfaceC <- alpha * A * D * A^T + beta * C
,C
in input is 100% symmetric because we assume the upper part from the lower part,D
in input is 100% symmetric because we assume the upper part from the lower part. So the outputC
is 100% symmetric. There is no arguing.C
in output is symmetric. When we look atC <- alpha * A * B + beta * C
. Argh. The symmetry of outputC
relies on the "quality" ofA
and the "quality" ofB
and the "quality" ofA*B
. I think there is more uncertainty on whetherA*B^T
orB^T * A
is the same. It should be the same. Is it? With the GEMMT interface, it is far from clear which one you should use. Argh. I think this is error prone, and not good for stability. I do not have good examples. But I do not like it.(2-6) Speaking of
C <- alpha * A * D * A^T + beta * C
, I am not sure what areference
algorithm would be. Mmm. Is there a "more stable way" to performA * D * A^T
?
Again, I could mention https://github.com/SLICOT/SLICOT-Reference/blob/main/src/MB01RU.f at this point.
(2-7) One argument that I can see is for GEMMT is that the users as a
fast
way to right-multiply symmetric matrixD
byA
. So, the user wants to doC <- alpha * A * D * A^T + beta * C
, but they have a fast way to doB <= A * D
. So then yes, GEMMT can be useful.
I implemented the routine for sake of completeness, since many of the optimized libraries provide it, but not regarding the meaningful usage, which is truly a medium to big problem as you pointed out. Ignoring the BLAS-like extensions, that are common to almost all the optimized libraries, is also not a good way on the other hand. First, if the reference implementation provides it, one had a bigger chance that the optimized version behave the same way. Second, while developing stuff on top of BLAS and LAPACK, many developers (at least the ones I know), switch back to the reference implementation for debugging purpose. Once a BLAS-like extensions is used, this either requires code changes or a build-system, which checks for the availability of the extensions. Third, if applications/libraries use BLAS-like extensions, they can crash library exchange systems like update-alternatives
easily.
I share the doubts about adding industry standard extensions to what is (or used to be?) seen as the reference implementation of something that was largely agreed on by working groups decades ago. Though to some extent that would hold for Reference-LAPACK as well - is it the classical reference if old algorithms get upgraded and new ones added, or is it the new and improved UCDenver LAPACK ?
Even tough the BLAS standard is now a very old one, one can slowly adjust it a bit to match the changed requirements by the users (as long a now old functionality gets disturbed). Furthermore, there are some bad
designs in the references interface as well, that could only be fixed by adding new routines, like the missing second scalar in daxpy
, which is covered by daxbpy
in many BLAS implementations, which is also not yet in the reference.
Would it be thinkable that the reference implements C <- alpha * A * D * A^T + beta * C
under a non-cryptic name? It would would apparently cover a majority of use-cases of GEMMT and by being the reference it's in the privileged position to show how things should be done properly.
One major issue with C <- alpha * A * D * A^T + beta * C
is that we would want A
to be input only.
And if D
is tridiagonal, that's easy enough to do. If D
is a regular m-by-m matrix, that's not going to work out without allocation. So that's probably the reason for the two alternatives: (1) GEMMT, and (2) BLAS_xSY_TRIDIAG_RK.
I had forgotten this "detail".
From the top of my head, I can say that GEMMT is used by ABINIT and possibly other Computational Chemistry, Molecular Modelling, and Materials Science packages. I can have a look into other packages that use GEMMT if necessary.
Yet two additional packages, PETSc and VASP.
After some search, I came up with the following packages that depend on MUMPS, PETSc, and ABINIT which could possibly benefit from GEMMT.
-
MUMPS dependent packages: Trilinos, Goma, Palace, xSDK, MFEM, CEED, FrontISTR, Ipopt, HPDDM, FEniCS, z-PARES, Akantu, Tandem, Elmer FEM, Camelia, Clp (Coin-or linear programming), FreeFEM++, Cbc (Coin-or branch and cut), Kwant, OpenSees, TELEMAC-MASCARET, TOY2DAC
-
PETSc dependent packages: PISM, DAMASK-grid, Sundials, PISM, Ratel, SLEPc, AMReX, ExaGO, Sowing, Akantu, libMesh, FreeFEM, openCARP, Alquimia, DAMASK-mesh, PHIST, DFT-FE, ELSI, AMP, libpressio, STEPS (STochastic Engine for Pathway Simulation), FEniCS-DOLFINx, SfePy, Gmsh
-
ABINIT dependent packages: Yambo, BigDFT
Would it be thinkable that the reference implements C <- alpha * A * D * A^T + beta * C under a non-cryptic name?
I was about to type this a few days ago but got distracted by work. xGEMMT
indeed reads like the matmul(m, transpose(m))
. I am also from the control theory side and I know why SLICOT uses them extensively since control algos typically love working with quadratic forms.
However if this function is going to be implemented in the reference, it should actually bring in some significant benefits beyond an implicit 2xDGEMM calls if possible in my opinion.
I was about to type this a few days ago but got distracted by work. xGEMMT indeed reads like the matmul(m, transpose(m)).
(1) I find the fact that the input/output matrix C
is symmetric is a significant omission in the GEMMT
interface name, I was not expecting any symmetry with a name such as GEMMT
(2) "GEMMT
indeed reads like the matmul(m, transpose(m))
." I think I understand what you mean but (2a) if it were a matmul(m, transpose(m))
, that would be a SYRK
, (2b) if it were matmul(m, transpose(m))
, we would have only one argument M
in the interface and not an A
and a B
, (2c) GEMMT
enables four variants: matmul(a,b)
,matmul(a, transpose(b))
, matmul(transpose(a),b)
and matmul(transpose(a),transpose(b))
. All in all, I understand that the intent is that it is something like a matmul(m, transpose(m))
.
In any case, I think, at this point the name of the subroutine is set. The reason to include the routine is that it is widely used. Changing the name now would defeat the reason of including it.
However if this function is going to be implemented in the reference, it should actually bring in some significant benefits beyond an implicit 2xDGEMM calls if possible in my opinion.
I think by "this function" you mean C <- alpha * A * D * A^T + beta * C
for a general D
, in which I case I strongly agree with you.
Advantage of BLAS_xSY_TRIDIAG_RK
over GEMM
+ GEMMT
: If D
is tridiagonal, I think this C <- alpha * A * D * A^T + beta * C
can bring some benefit over GEMM
+ GEMMT
by (1) reducing data movement of A
, (2) avoiding a B
buffer to store A * D
.
Advantage of GEMMT
over GEMM
: GEMMT
is half the FLOPS of GEMM
.
I think by "this function" you mean C <- alpha * A * D * A^T + beta * C for a general D, in which I case I strongly agree with you.
Sorry for my terse style, I was typing way quicker than I can make sense. Indeed that's what I meant.
I find the fact that the input/output matrix C is symmetric is a significant omission in the GEMMT interface name,
Indeed. And fully agree with your following points.
Changing the name now would defeat the reason of including it.
I think we can live with a name change because there is no precedence to it, and like you mentioned, the naming is already not optimal. LAPACK names are already cryptic enough so in my seriously humble opinion, there is no reason to carry on the legacy of unofficial sources or suboptimal decisions made elsewhere.
There is enough legacy within LAPACK to live with and. For anyone that needs to switch to the reference implementation of this "GEMMT" they will need to modify code anyways. Thus I think it gives the freedom to correct this naming. But then again that's just my opinion.
Functionality-wise similar routines might be requested such as triangular-triangular multiplication that comes in Sylvester/Schur related algorithms. Hence I'm not sure halving the matmul flops is a killer feature. But again, these are all personal takes, if folks are using it apparently they need it.
Functionality-wise similar routines might be requested such as triangular-triangular multiplication that comes in Sylvester/Schur related algorithms. Hence I'm not sure halving the matmul flops is a killer feature. But again, these are all personal takes, if folks are using it apparently they need it.
I agree that "similar routines might be requested such as triangular-triangular multiplication". We also had needs for similar "weird" kernels in <T>LAPACK. We were thinking to give names like matmult_AtrBtrCtr where each matrix has each shape in the interface. (Here tr for triangular for example.) "I'm not sure halving the matmul flops is a killer feature." FLOPS is one argument, but there are also arguments for (1) less storage, (2) in-place vs out-of-place and (3) access of data.
Joining the thread because I'm interested in this functionality.
I'm working on an iterative linear system solver with @PTNobel based on a block Krylov subspace method. This method (and many like it) need to compute small square matrices of the form P' G P
where G
is PSD. The context is that we'll have already computed GP
for other reasons in the algorithm. So we'd benefit from a function to compute P' Q
where Q
is a general matrix.
I have a feature request for the hypothetical function that exploits the known symmetry of P' Q
. In the uplo
argument, it would be great to allow a "general" flag so that only one triangle is computed in terms of dot products, but then the result is copied to the remaining triangle. This will ensure explicit symmetry of the output matrix. We find in our applications that computing M = P' Q
with GEMM results in a matrix that is symmetric up to a normwise relative error of 5e-16 in double precision, but this difference is apparently enough for one of chol(M, "lower")
and chol(M, "upper")
to fail when the other succeeds.
(Please excuse the lazy notation above w.r.t. chol
.)
gemmt
was introduced in Intel MKL, circa 2015.
In cuBLAS, similar functionality is called syrkx
and herkx
, which predated the MKL gemmt
function (not sure exactly what version of CUDA it was introduced).
Likewise in rocBLAS / hipBLAS, they are syrkx
and herkx
.
herk
does
C = A A^H + C
for a Hermitian matrix C, computing only the lower or upper triangle.
herkx
is an extended version of herk
that does
C = A B^H + C
but again computes only the lower or upper triangle. In general, A B^H would be non-Hermitian, but herkx
is useful when the user can guarantee that the result is Hermitian. As discussed above, a typical use case is
C = B D B^H, thus A = B D
where D is Hermitian. A specific instance of that is when D is diagonal, e.g., generating a Hermitian matrix from its EVD or computing the backward error of an eigenvalue problem,
A - Q Lambda Q^H.
As a completely different application, it is also useful to compute a triangular matrix, instead of a Hermitian matrix, when the user knows that one of the triangles will be zero. This could be used in larft
to generate the T matrix for a block Householder reflector. (The input matrices are still rectangular, not triangular, we just have extra knowledge that the output is triangular.)
I agree the t
in gemmt
doesn't fit into the BLAS naming scheme at all. Does it mean transpose? Does it mean triangular? I personally like the herkx
name, as it signifies that the output matrix is Hermitian, although that doesn't fit the rather odd use case above when the output matrix is triangular, not Hermitian. gemmt
is also a bit more general than herkx
in that it can specify both transA
and transB
, whereas herkx
has only a single trans
.
Also consider if we had this routine in other formats like packed or RFP, obviously the gemmt
name doesn't generalize well, whereas hprkx
(packed) or hfrkx
(RFP) would make sense.
Another use case is in the Polar Decomposition. The QDWH iteration converges to a unitary matrix U, and then computes a Hermitian matrix H = U^H A. Again, we have special knowledge that U^H A is Hermitian. This is distinct from the use cases of A B A^H, with B Hermitian.
@luszczek suggests herkmm
instead of herkx
.
@mgates3 very nice application! Yes, indeed, if for a given matrix A, you know U, the unitary polar factor of A, then you can recover the Hermitian polar factor by doing U^H * A. That's a really neat application. Thanks for bringing this up. And indeed the QDWH iterations algorithm does need GEMMT. It converges to U, the unitary polar factor of A, and so, "after convergence," it computes the Hermitian factor with U^H * A, which is exactly GEMMT. This is a very nice application. Thanks for mentioning.
We need to speak about names.
The PR named the routine GEMMT.
The names HERKX and GEMMT are already used for this functionality in some BLAS libraries.
Then HERKMM is kind of nice and was suggested by @luszczek.
Then I suggested something like HEGEGEMM.
But @mgates3 mentioned that this can be used for triangular matrix too, so maybe we could call this TRGEGEMM.
If you have a strong opinion, write it below.
The triangular application is rather weird and probably rare. How often is A*B a triangular matrix? So I wouldn't name it based on triangular. Interpreting the output as Hermitian (or symmetric) is much more common.
I'm not sure why you're proposing repeating the ge
in hegegemm
. Is it to specify the type of each matrix A, B, C? We've never done that before, and if we did, it's unclear what the order should be: hegege or gegehe (since C is Hermitian).
hegemm
would be okay.
I prefer some variant of herk
because that's what it most looks like: the output matrix C is Hermitian, whereas in hemm
the input matrix A or B is Hermitian.
Hi @grisuthedragon. Would you be able to merge to/rebase with the current master? The script for the Github Actions needed to be updated. Thanks!
Hi @grisuthedragon. Would you be able to merge to/rebase with the current master? The script for the Github Actions needed to be updated. Thanks!
Done.
So finally, it seems that only the naming is the problem. I see, that from the typical BLAS naming scheme, the GEMMT
name is confusing, so I do not have a problem to rename it.
As far as I collect the stuff from the discussion, there are the following options:
Already in use:
-
xGEMMT
, bad from the naming scheme point of view but in MKL, OpenBLAS, ArmPL, ESSL, BLIS -
xSYRKX
/xHESYRKX
, good name from the BLAS point of view, already in CUDA and rocBLAS
Potential names by functionality:
-
xSYGEMM
/xHEGEMM
-
xSYRKMM
/xHERKMM
-
xSYGEGEMM
/HEGEGEMM
Since the functionality already exists in some implementations, I would go for one of the existing names.
Please keep in mind that LAPACK is a long running software library with an established naming scheme. LAPACK also introduces new naming conventions as their needed for clarity. This new routine is an opportunity to do exactly that: establish the long needed naming scheme for more complex functionality going forward. Copy-pasting the name from somewhere else is not exactly my idea of good user-facing practice.
The existing names, already present in some of the libraries, can follow an easy path of deprecation that is now established in most vendor software. Also, linker aliasing could be used so the API entry points don't have to be duplicated.
I definitely like one of the new names that show how to incorporate an existing name with a new application context with wide appeal and good indication of what the routine does internally. I would encourage to use one them which will give us new naming spaces for new routines in the future.
@sknepper Discussion of interest.
If a library already provides this functionality, then this library could keep on providing this functionality under the current name that they use. And then it would be nice if they can add the name that comes out of this discussion in addition. Two names for the same functionality in one library is not ideal, but that would solve the backward compatibility issue.
I propose the names GESYRK
and GEHERK
. GEMMT is like a generalized SYRK / HERK.
I would vote for either
-
geherk
— generalized version ofherk
-
hegemm
— Hermitian version ofgemm
(and thesy
symmetric versions of these.)
One item for consideration which may influence that choice is whether it takes one trans
operation as in herk
, or two transA
and transB
operations as in gemm
.
The syrk / herk
-like and gemm
-like interfaces support these operations, which applications would definitely use:
- C = A B^T + C
// syrk( trans=N ), gemm( transA=N, transB=T )
- C = A^T B + C
// syrk( trans=T ), gemm( transA=T, transB=N )
- C = A B^H + C
// herk( trans=N ), gemm( transA=N, transB=C )
- C = A^H B + C
// herk( trans=C ), gemm( transA=C, transB=N )
The gemm
-like interface additionally supports these operations:
- C = A B + C
// gemm( transA=N, transB=N )
- C = A^T B^T + C
// gemm( transA=T, transB=T )
- C = A^H B^H + C
// gemm( transA=C, transB=C )
- C = A^T B^H + C
// gemm( transA=T, transB=C )
- C = A^H B^T + C
// gemm( transA=C, transB=T )
It's unclear to me if applications would use those later 5 operations, and especially the last 2 operations (mixed T, C). That's a lot of extra cases to implement, optimize, and test.
With [sdcz]gemmt
, there's just one routine that can handle both Hermitian and symmetric cases with appropriate choice of transA
and transB
. If we have [cz]hegemm
and [sdcz]sygemm
, what is the distinction between them? (Remembering there are complex-symmetric matrices in addition to complex-Hermitian.) Is the only difference which trans[AB]
are supported by each?
Does [cz]hegemm
/ [cz]geherk
guarantee that diagonal elements are real? herk
guarantees this.
For [cz]hegemm
/ [cz]geherk
, are alpha and beta required to be real? [cz]herk
requires real alpha and beta; [cz]syrk
has complex alpha and beta. Currently, [cz]gemmt
has complex alpha and beta. Interestingly, cublasZherkx
has complex alpha but real beta.
I'm not in favor of herkmm
as it mixes 2 operations (rk
and mm
), so doesn't fit well into the BLAS naming scheme.
Nor hegegemm
; there's already a routine that takes a Hermitian matrix and 2 general matrices, in that order; it's called hemm
. More generally, BLAS hasn't specified the types of all 3 matrices in the name, so that seems like a larger departure from past practice.