lapack icon indicating copy to clipboard operation
lapack copied to clipboard

Implement xGEMMT and cblas_xGEMMT

Open grisuthedragon opened this issue 1 year ago • 30 comments

Description

The xGEMMT function implements the GEMM operation but only the lower or the upper triangle of C is accessed. The routine is provide by many optimized BLAS libraries but not by the reference implementation. Due to the fact, that projects like MUMPS used it makes it necessary to provide this function in the reference implementation. This enables projects, that used this function to in runtime environments that allow the exchange of the BLAS libraries, like update-alternative on Debian/Ubuntu or FlexiBLAS.

The xGEMMT function are provided for example by:

  • https://www.ibm.com/docs/en/essl/6.2?topic=mos-dgemmt-combined-matrix-multiplication-addition-general-matrices-their-transposes-conjugate-transposes-but-update-only-upper-lower-triangular-portion-array
  • https://developer.arm.com/documentation/101004/2100/BLAS-Basic-Linear-Algebra-Subprograms/BLAS-Extensions/dgemmt
  • https://github.com/HPAC/ReLAPACK/blob/master/src/dgemmt.c
  • https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2023-1/gemmt.html
  • https://github.com/xianyi/OpenBLAS/pull/3796
  • https://github.com/flame/blis/blob/master/docs/BLISTypedAPI.md

The PR includes subroutines, the tests, and the CBLAS interface.

Checklist

  • [x] The documentation has been updated.

grisuthedragon avatar Jul 24 '23 15:07 grisuthedragon

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 0.00%. Comparing base (69992ad) to head (f90c7ae). Report is 8 commits behind head on master.

:exclamation: Current head f90c7ae differs from pull request most recent head c57c156

Please upload reports for the commit c57c156 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff            @@
##           master     #887     +/-   ##
=========================================
  Coverage    0.00%    0.00%             
=========================================
  Files        1937     1918     -19     
  Lines      190566   188614   -1952     
=========================================
+ Misses     190566   188614   -1952     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jul 24 '23 15:07 codecov[bot]

Thanks Martin.

Two parts. (1) overall in favor. (2) rant against this specific GEMMT routine. To help me: who knows usage of GEMMT?

(1) Oh my. Adding a routine in the reference Level 3 BLAS? I am overall in favor since GEMMT already exists in many BLAS implementations. And I am happy to see Level 3 BLAS evolves and adapts.

(2-1) I must say that overall I am not a big fan of this GEMMT routine. I wonder why people need it. It seems popular enough. In many libraries. (Thanks for the survey.) But why do users need it? What are some examples of usage?

(2-2) I would rather see a routine like C <- alpha * A * D * A^T + beta * C where C is n-by-n symmetric in input and D is m-by-m symmetric in input too. (And so A is n-by-m.) That would be useful. In the BLAS Technical Forum Standard, they had BLAS_xSY_TRIDIAG_RK where, in addition of D being symmetric, the matrix D was symmetric tridiagonal. This BLAS_xSY_TRIDIAG_RK routine would be useful too.

(2-3) I wonder whether users need C <- alpha * A * D * A^T + beta * C and then they first perform GEMM with B <= 1. * A * D + 0. * B and then GEMMT with C <- alpha * A * B + beta * C. Well, if that's the case, it would be really, in my opinion, much better to use C <- alpha * A * D * A^T + beta * C.

(2-4) If we use C <- alpha * A * D * A^T + beta * C, we can reduce data movement from A and A^T by "combining" the data movement of A and A^T as shown in https://doi.org/10.1145/3490148.3538587 https://doi.org/10.1109/SC41404.2022.00034 https://doi.org/10.1145/3558481.3591072 as opposed to use a GEMM follows by a GEMMT.

(2-5) Also I do not like the fact that the symmetry of A*B^T relies on what A and B are. Argh. I do not like this. When we look at the interface C <- alpha * A * D * A^T + beta * C, C in input is 100% symmetric because we assume the upper part from the lower part, D in input is 100% symmetric because we assume the upper part from the lower part. So the output C is 100% symmetric. There is no arguing. C in output is symmetric. When we look at C <- alpha * A * B + beta * C. Argh. The symmetry of output C relies on the "quality" of A and the "quality" of B and the "quality" of A*B. I think there is more uncertainty on whether A*B^T or B^T * A is the same. It should be the same. Is it? With the GEMMT interface, it is far from clear which one you should use. Argh. I think this is error prone, and not good for stability. I do not have good examples. But I do not like it.

(2-6) Speaking of C <- alpha * A * D * A^T + beta * C, I am not sure what a reference algorithm would be. Mmm. Is there a "more stable way" to perform A * D * A^T?

(2-7) One argument that I can see is for GEMMT is that the users as a fast way to right-multiply symmetric matrix D by A. So, the user wants to do C <- alpha * A * D * A^T + beta * C, but they have a fast way to do B <= A * D. So then yes, GEMMT can be useful.

langou avatar Jul 24 '23 19:07 langou

And let me add. The name. GEMMT. I understand that this is institutionalized in several BLAS libraries. Fine with me. But I find GEMMT a very cryptic name. I would have not have guessed the functionality from the GEMMT name.

langou avatar Jul 24 '23 19:07 langou

I share the doubts about adding industry standard extensions to what is (or used to be?) seen as the reference implementation of something that was largely agreed on by working groups decades ago. Though to some extent that would hold for Reference-LAPACK as well - is it the classical reference if old algorithms get upgraded and new ones added, or is it the new and improved UCDenver LAPACK ?

martin-frbg avatar Jul 25 '23 07:07 martin-frbg

(2-1) I must say that overall I am not a big fan of this GEMMT routine. I wonder why people need it. It seems popular enough. In many libraries. (Thanks for the survey.) But why do users need it? What are some examples of usage?

At least some library using it, is MUMPS 5.6.1 in src/{s,c,d,z}fac_front_aux.F. Doing a short research, it shows, that they are using this in an LDL^T factorization, updating the L matrices. Thus, this seems to the C <- alpha * A * D * A^T + beta * C approach.

(2-2) I would rather see a routine like C <- alpha * A * D * A^T + beta * C where C is n-by-n symmetric in input and D is m-by-m symmetric in input too. (And so A is n-by-m.) That would be useful. In the BLAS Technical Forum Standard, they had BLAS_xSY_TRIDIAG_RK where, in addition of D being symmetric, the matrix D was symmetric tridiagonal. This BLAS_xSY_TRIDIAG_RK routine would be useful too.

IMHO, only allowing D to be a symmetric tridiagonal is a too large restriction.

(2-3) I wonder whether users need C <- alpha * A * D * A^T + beta * C and then they first perform GEMM with B <= 1. * A * D + 0. * B and then GEMMT with C <- alpha * A * B + beta * C. Well, if that's the case, it would be really, in my opinion, much better to use C <- alpha * A * D * A^T + beta * C.

For the most application cases I could think off, the combined variant would be better. For example, when solving a Lyapunov equation with the Bartels-Stewart algorithm and the Schur-Decomposition, one ends up with these type of updates.

(2-4) If we use C <- alpha * A * D * A^T + beta * C, we can reduce data movement from A and A^T by "combining" the data movement of A and A^T as shown in https://doi.org/10.1145/3490148.3538587 https://doi.org/10.1109/SC41404.2022.00034 https://doi.org/10.1145/3558481.3591072 as opposed to use a GEMM follows by a GEMMT.

The is also an implementation of this functionality in https://github.com/SLICOT/SLICOT-Reference/blob/main/src/MB01RU.f, which is used in the case I mentioned in (2-3). But the approach is horribly slow.

(2-5) Also I do not like the fact that the symmetry of A*B^T relies on what A and B are. Argh. I do not like this. When we look at the interface C <- alpha * A * D * A^T + beta * C, C in input is 100% symmetric because we assume the upper part from the lower part, D in input is 100% symmetric because we assume the upper part from the lower part. So the output C is 100% symmetric. There is no arguing. C in output is symmetric. When we look at C <- alpha * A * B + beta * C. Argh. The symmetry of output C relies on the "quality" of A and the "quality" of B and the "quality" of A*B. I think there is more uncertainty on whether A*B^T or B^T * A is the same. It should be the same. Is it? With the GEMMT interface, it is far from clear which one you should use. Argh. I think this is error prone, and not good for stability. I do not have good examples. But I do not like it.

(2-6) Speaking of C <- alpha * A * D * A^T + beta * C, I am not sure what a reference algorithm would be. Mmm. Is there a "more stable way" to perform A * D * A^T?

Again, I could mention https://github.com/SLICOT/SLICOT-Reference/blob/main/src/MB01RU.f at this point.

(2-7) One argument that I can see is for GEMMT is that the users as a fast way to right-multiply symmetric matrix D by A. So, the user wants to do C <- alpha * A * D * A^T + beta * C, but they have a fast way to do B <= A * D. So then yes, GEMMT can be useful.

I implemented the routine for sake of completeness, since many of the optimized libraries provide it, but not regarding the meaningful usage, which is truly a medium to big problem as you pointed out. Ignoring the BLAS-like extensions, that are common to almost all the optimized libraries, is also not a good way on the other hand. First, if the reference implementation provides it, one had a bigger chance that the optimized version behave the same way. Second, while developing stuff on top of BLAS and LAPACK, many developers (at least the ones I know), switch back to the reference implementation for debugging purpose. Once a BLAS-like extensions is used, this either requires code changes or a build-system, which checks for the availability of the extensions. Third, if applications/libraries use BLAS-like extensions, they can crash library exchange systems like update-alternatives easily.

grisuthedragon avatar Jul 25 '23 07:07 grisuthedragon

I share the doubts about adding industry standard extensions to what is (or used to be?) seen as the reference implementation of something that was largely agreed on by working groups decades ago. Though to some extent that would hold for Reference-LAPACK as well - is it the classical reference if old algorithms get upgraded and new ones added, or is it the new and improved UCDenver LAPACK ?

Even tough the BLAS standard is now a very old one, one can slowly adjust it a bit to match the changed requirements by the users (as long a now old functionality gets disturbed). Furthermore, there are some bad designs in the references interface as well, that could only be fixed by adding new routines, like the missing second scalar in daxpy, which is covered by daxbpy in many BLAS implementations, which is also not yet in the reference.

grisuthedragon avatar Jul 25 '23 07:07 grisuthedragon

Would it be thinkable that the reference implements C <- alpha * A * D * A^T + beta * C under a non-cryptic name? It would would apparently cover a majority of use-cases of GEMMT and by being the reference it's in the privileged position to show how things should be done properly.

vladimir-ch avatar Jul 25 '23 18:07 vladimir-ch

One major issue with C <- alpha * A * D * A^T + beta * C is that we would want A to be input only.

And if D is tridiagonal, that's easy enough to do. If D is a regular m-by-m matrix, that's not going to work out without allocation. So that's probably the reason for the two alternatives: (1) GEMMT, and (2) BLAS_xSY_TRIDIAG_RK.

I had forgotten this "detail".

langou avatar Jul 25 '23 19:07 langou

From the top of my head, I can say that GEMMT is used by ABINIT and possibly other Computational Chemistry, Molecular Modelling, and Materials Science packages. I can have a look into other packages that use GEMMT if necessary.

robert-mijakovic avatar Jul 26 '23 10:07 robert-mijakovic

Yet two additional packages, PETSc and VASP.

robert-mijakovic avatar Jul 26 '23 14:07 robert-mijakovic

After some search, I came up with the following packages that depend on MUMPS, PETSc, and ABINIT which could possibly benefit from GEMMT.

  1. MUMPS dependent packages: Trilinos, Goma, Palace, xSDK, MFEM, CEED, FrontISTR, Ipopt, HPDDM, FEniCS, z-PARES, Akantu, Tandem, Elmer FEM, Camelia, Clp (Coin-or linear programming), FreeFEM++, Cbc (Coin-or branch and cut), Kwant, OpenSees, TELEMAC-MASCARET, TOY2DAC

  2. PETSc dependent packages: PISM, DAMASK-grid, Sundials, PISM, Ratel, SLEPc, AMReX, ExaGO, Sowing, Akantu, libMesh, FreeFEM, openCARP, Alquimia, DAMASK-mesh, PHIST, DFT-FE, ELSI, AMP, libpressio, STEPS (STochastic Engine for Pathway Simulation), FEniCS-DOLFINx, SfePy, Gmsh

  3. ABINIT dependent packages: Yambo, BigDFT

robert-mijakovic avatar Jul 26 '23 15:07 robert-mijakovic

Would it be thinkable that the reference implements C <- alpha * A * D * A^T + beta * C under a non-cryptic name?

I was about to type this a few days ago but got distracted by work. xGEMMT indeed reads like the matmul(m, transpose(m)). I am also from the control theory side and I know why SLICOT uses them extensively since control algos typically love working with quadratic forms.

However if this function is going to be implemented in the reference, it should actually bring in some significant benefits beyond an implicit 2xDGEMM calls if possible in my opinion.

ilayn avatar Jul 28 '23 10:07 ilayn

I was about to type this a few days ago but got distracted by work. xGEMMT indeed reads like the matmul(m, transpose(m)).

(1) I find the fact that the input/output matrix C is symmetric is a significant omission in the GEMMT interface name, I was not expecting any symmetry with a name such as GEMMT(2) "GEMMT indeed reads like the matmul(m, transpose(m))." I think I understand what you mean but (2a) if it were a matmul(m, transpose(m)), that would be a SYRK, (2b) if it were matmul(m, transpose(m)), we would have only one argument M in the interface and not an A and a B, (2c) GEMMT enables four variants: matmul(a,b),matmul(a, transpose(b)), matmul(transpose(a),b) and matmul(transpose(a),transpose(b)). All in all, I understand that the intent is that it is something like a matmul(m, transpose(m)).

In any case, I think, at this point the name of the subroutine is set. The reason to include the routine is that it is widely used. Changing the name now would defeat the reason of including it.

langou avatar Jul 28 '23 14:07 langou

However if this function is going to be implemented in the reference, it should actually bring in some significant benefits beyond an implicit 2xDGEMM calls if possible in my opinion.

I think by "this function" you mean C <- alpha * A * D * A^T + beta * C for a general D, in which I case I strongly agree with you.

Advantage of BLAS_xSY_TRIDIAG_RK over GEMM + GEMMT: If D is tridiagonal, I think this C <- alpha * A * D * A^T + beta * C can bring some benefit over GEMM + GEMMT by (1) reducing data movement of A, (2) avoiding a B buffer to store A * D.

Advantage of GEMMT over GEMM: GEMMT is half the FLOPS of GEMM.

langou avatar Jul 28 '23 14:07 langou

I think by "this function" you mean C <- alpha * A * D * A^T + beta * C for a general D, in which I case I strongly agree with you.

Sorry for my terse style, I was typing way quicker than I can make sense. Indeed that's what I meant.

I find the fact that the input/output matrix C is symmetric is a significant omission in the GEMMT interface name,

Indeed. And fully agree with your following points.

Changing the name now would defeat the reason of including it.

I think we can live with a name change because there is no precedence to it, and like you mentioned, the naming is already not optimal. LAPACK names are already cryptic enough so in my seriously humble opinion, there is no reason to carry on the legacy of unofficial sources or suboptimal decisions made elsewhere.

There is enough legacy within LAPACK to live with and. For anyone that needs to switch to the reference implementation of this "GEMMT" they will need to modify code anyways. Thus I think it gives the freedom to correct this naming. But then again that's just my opinion.

Functionality-wise similar routines might be requested such as triangular-triangular multiplication that comes in Sylvester/Schur related algorithms. Hence I'm not sure halving the matmul flops is a killer feature. But again, these are all personal takes, if folks are using it apparently they need it.

ilayn avatar Jul 29 '23 09:07 ilayn

Functionality-wise similar routines might be requested such as triangular-triangular multiplication that comes in Sylvester/Schur related algorithms. Hence I'm not sure halving the matmul flops is a killer feature. But again, these are all personal takes, if folks are using it apparently they need it.

I agree that "similar routines might be requested such as triangular-triangular multiplication". We also had needs for similar "weird" kernels in <T>LAPACK. We were thinking to give names like matmult_AtrBtrCtr where each matrix has each shape in the interface. (Here tr for triangular for example.) "I'm not sure halving the matmul flops is a killer feature." FLOPS is one argument, but there are also arguments for (1) less storage, (2) in-place vs out-of-place and (3) access of data.

langou avatar Jul 31 '23 15:07 langou

Joining the thread because I'm interested in this functionality.

I'm working on an iterative linear system solver with @PTNobel based on a block Krylov subspace method. This method (and many like it) need to compute small square matrices of the form P' G P where G is PSD. The context is that we'll have already computed GP for other reasons in the algorithm. So we'd benefit from a function to compute P' Q where Q is a general matrix.

I have a feature request for the hypothetical function that exploits the known symmetry of P' Q. In the uplo argument, it would be great to allow a "general" flag so that only one triangle is computed in terms of dot products, but then the result is copied to the remaining triangle. This will ensure explicit symmetry of the output matrix. We find in our applications that computing M = P' Q with GEMM results in a matrix that is symmetric up to a normwise relative error of 5e-16 in double precision, but this difference is apparently enough for one of chol(M, "lower") and chol(M, "upper") to fail when the other succeeds.

(Please excuse the lazy notation above w.r.t. chol.)

rileyjmurray avatar Aug 10 '23 15:08 rileyjmurray

gemmt was introduced in Intel MKL, circa 2015.

In cuBLAS, similar functionality is called syrkx and herkx, which predated the MKL gemmt function (not sure exactly what version of CUDA it was introduced).

Likewise in rocBLAS / hipBLAS, they are syrkx and herkx.

herk does

C = A A^H + C

for a Hermitian matrix C, computing only the lower or upper triangle. herkx is an extended version of herk that does

C = A B^H + C

but again computes only the lower or upper triangle. In general, A B^H would be non-Hermitian, but herkx is useful when the user can guarantee that the result is Hermitian. As discussed above, a typical use case is

C = B D B^H, thus A = B D

where D is Hermitian. A specific instance of that is when D is diagonal, e.g., generating a Hermitian matrix from its EVD or computing the backward error of an eigenvalue problem,

A - Q Lambda Q^H.

As a completely different application, it is also useful to compute a triangular matrix, instead of a Hermitian matrix, when the user knows that one of the triangles will be zero. This could be used in larft to generate the T matrix for a block Householder reflector. (The input matrices are still rectangular, not triangular, we just have extra knowledge that the output is triangular.)

I agree the t in gemmt doesn't fit into the BLAS naming scheme at all. Does it mean transpose? Does it mean triangular? I personally like the herkx name, as it signifies that the output matrix is Hermitian, although that doesn't fit the rather odd use case above when the output matrix is triangular, not Hermitian. gemmt is also a bit more general than herkx in that it can specify both transA and transB, whereas herkx has only a single trans.

Also consider if we had this routine in other formats like packed or RFP, obviously the gemmt name doesn't generalize well, whereas hprkx (packed) or hfrkx (RFP) would make sense.

mgates3 avatar Aug 10 '23 16:08 mgates3

Another use case is in the Polar Decomposition. The QDWH iteration converges to a unitary matrix U, and then computes a Hermitian matrix H = U^H A. Again, we have special knowledge that U^H A is Hermitian. This is distinct from the use cases of A B A^H, with B Hermitian.

@luszczek suggests herkmm instead of herkx.

mgates3 avatar Aug 11 '23 15:08 mgates3

@mgates3 very nice application! Yes, indeed, if for a given matrix A, you know U, the unitary polar factor of A, then you can recover the Hermitian polar factor by doing U^H * A. That's a really neat application. Thanks for bringing this up. And indeed the QDWH iterations algorithm does need GEMMT. It converges to U, the unitary polar factor of A, and so, "after convergence," it computes the Hermitian factor with U^H * A, which is exactly GEMMT. This is a very nice application. Thanks for mentioning.

langou avatar Aug 12 '23 22:08 langou

We need to speak about names.

The PR named the routine GEMMT.

The names HERKX and GEMMT are already used for this functionality in some BLAS libraries.

Then HERKMM is kind of nice and was suggested by @luszczek.

Then I suggested something like HEGEGEMM.

But @mgates3 mentioned that this can be used for triangular matrix too, so maybe we could call this TRGEGEMM.

If you have a strong opinion, write it below.

langou avatar Aug 12 '23 22:08 langou

The triangular application is rather weird and probably rare. How often is A*B a triangular matrix? So I wouldn't name it based on triangular. Interpreting the output as Hermitian (or symmetric) is much more common.

I'm not sure why you're proposing repeating the ge in hegegemm. Is it to specify the type of each matrix A, B, C? We've never done that before, and if we did, it's unclear what the order should be: hegege or gegehe (since C is Hermitian).

hegemm would be okay.

I prefer some variant of herk because that's what it most looks like: the output matrix C is Hermitian, whereas in hemm the input matrix A or B is Hermitian.

mgates3 avatar Aug 14 '23 04:08 mgates3

Hi @grisuthedragon. Would you be able to merge to/rebase with the current master? The script for the Github Actions needed to be updated. Thanks!

weslleyspereira avatar Aug 21 '23 21:08 weslleyspereira

Hi @grisuthedragon. Would you be able to merge to/rebase with the current master? The script for the Github Actions needed to be updated. Thanks!

Done.

grisuthedragon avatar Aug 22 '23 07:08 grisuthedragon

So finally, it seems that only the naming is the problem. I see, that from the typical BLAS naming scheme, the GEMMT name is confusing, so I do not have a problem to rename it.

As far as I collect the stuff from the discussion, there are the following options:

Already in use:

  • xGEMMT, bad from the naming scheme point of view but in MKL, OpenBLAS, ArmPL, ESSL, BLIS
  • xSYRKX/xHESYRKX, good name from the BLAS point of view, already in CUDA and rocBLAS

Potential names by functionality:

  • xSYGEMM/xHEGEMM
  • xSYRKMM/xHERKMM
  • xSYGEGEMM/HEGEGEMM

Since the functionality already exists in some implementations, I would go for one of the existing names.

grisuthedragon avatar Aug 30 '23 12:08 grisuthedragon

Please keep in mind that LAPACK is a long running software library with an established naming scheme. LAPACK also introduces new naming conventions as their needed for clarity. This new routine is an opportunity to do exactly that: establish the long needed naming scheme for more complex functionality going forward. Copy-pasting the name from somewhere else is not exactly my idea of good user-facing practice.

The existing names, already present in some of the libraries, can follow an easy path of deprecation that is now established in most vendor software. Also, linker aliasing could be used so the API entry points don't have to be duplicated.

I definitely like one of the new names that show how to incorporate an existing name with a new application context with wide appeal and good indication of what the routine does internally. I would encourage to use one them which will give us new naming spaces for new routines in the future.

luszczek avatar Aug 30 '23 14:08 luszczek

@sknepper Discussion of interest.

mgates3 avatar Aug 30 '23 16:08 mgates3

If a library already provides this functionality, then this library could keep on providing this functionality under the current name that they use. And then it would be nice if they can add the name that comes out of this discussion in addition. Two names for the same functionality in one library is not ideal, but that would solve the backward compatibility issue.

langou avatar Aug 31 '23 15:08 langou

I propose the names GESYRK and GEHERK. GEMMT is like a generalized SYRK / HERK.

amontoison avatar Nov 26 '23 11:11 amontoison

I would vote for either

  • geherk — generalized version of herk
  • hegemm — Hermitian version of gemm (and the sy symmetric versions of these.)

One item for consideration which may influence that choice is whether it takes one trans operation as in herk, or two transA and transB operations as in gemm.

The syrk / herk-like and gemm-like interfaces support these operations, which applications would definitely use:

  • C = A B^T + C // syrk( trans=N ), gemm( transA=N, transB=T )
  • C = A^T B + C // syrk( trans=T ), gemm( transA=T, transB=N )
  • C = A B^H + C // herk( trans=N ), gemm( transA=N, transB=C )
  • C = A^H B + C // herk( trans=C ), gemm( transA=C, transB=N )

The gemm-like interface additionally supports these operations:

  • C = A B + C // gemm( transA=N, transB=N )
  • C = A^T B^T + C // gemm( transA=T, transB=T )
  • C = A^H B^H + C // gemm( transA=C, transB=C )
  • C = A^T B^H + C // gemm( transA=T, transB=C )
  • C = A^H B^T + C // gemm( transA=C, transB=T )

It's unclear to me if applications would use those later 5 operations, and especially the last 2 operations (mixed T, C). That's a lot of extra cases to implement, optimize, and test.

With [sdcz]gemmt, there's just one routine that can handle both Hermitian and symmetric cases with appropriate choice of transA and transB. If we have [cz]hegemm and [sdcz]sygemm, what is the distinction between them? (Remembering there are complex-symmetric matrices in addition to complex-Hermitian.) Is the only difference which trans[AB] are supported by each?

Does [cz]hegemm / [cz]geherk guarantee that diagonal elements are real? herk guarantees this.

For [cz]hegemm / [cz]geherk, are alpha and beta required to be real? [cz]herk requires real alpha and beta; [cz]syrk has complex alpha and beta. Currently, [cz]gemmt has complex alpha and beta. Interestingly, cublasZherkx has complex alpha but real beta.

I'm not in favor of herkmm as it mixes 2 operations (rk and mm), so doesn't fit well into the BLAS naming scheme.

Nor hegegemm; there's already a routine that takes a Hermitian matrix and 2 general matrices, in that order; it's called hemm. More generally, BLAS hasn't specified the types of all 3 matrices in the name, so that seems like a larger departure from past practice.

mgates3 avatar Dec 04 '23 14:12 mgates3