OpenBLAS
OpenBLAS copied to clipboard
Request GEMMT API
Intel MKL provides an additional GEMM variant, GEMMT, that updates only the upper or lower triangular part of the result matrix. This would be a great addition to OpenBLAS.
https://software.intel.com/en-us/node/590047
You mean - to wrap _symm with 2 transforms?
I believe that the difference is more than wrapping SYMM with two transforms, but maybe I am misunderstanding your question.
SYMM is described as C := alpha_A_B + beta*C https://software.intel.com/en-us/node/468488
GEMMT is described as C := alpha_op(A)_op(B) + beta*C https://software.intel.com/en-us/node/590047
with different limitations on A, B, and C. For GEMMT, the upper triangular or lower triangular part of C is overwritten by the respective part of the result.
I wrote DGEMMT in julia and in no-transform case it became plain DSYMM call... It is really k--ka-s job they did with llvm
This would be nice! :)
Hi.
I'm curious if anyone's still tuned-in for this request.
To reillustrate the problem, GEMMT computes only a part of GEMM.
It's different from SYMM in my opinion as neither of the 2 operands A&B has symmetry restrictions.
Rather, it's more similar to SYR2K but with only A*transpose(B) term.
GEMMT is beyond the BLAS standard but I guess it's implementation could be very close to SRYK/HERK?
Not implemented yet, but not forgotten either. I agree that it does look more like syr2k than symm.
Fwiw, MUMPS solver ( http://mumps.enseeiht.fr/doc/userguide_5.4.1.pdf ) would benefit a lot: "We strongly recommend to use this ability if your BLAS library enables it"
I'd like to second that : GEMMT feature's implementation would be great :).
ReLAPACK provides this: https://github.com/HPAC/ReLAPACK/blob/master/src/dgemmt.c
Thanks for the pointer, I still believe this feature might be offered by the BLAS it-self (as with other BLAS implementations on the market).
Not the reference one: https://github.com/Reference-LAPACK/lapack/search?q=dgemmt shows no hits.
agreed, however when looking at MKL and BLIS, you can see them supported as it enables extra performance on various workload (MUMPS Direct Sparse Solver for instance); it seems like a low hanging fruit.
Also, I understand why you gave the pointer to ReLAPACK, thanks for sharing it.
Lots of low hanging fruit but never enough pickers (happens in real-world orchards too). ReLAPACK is included in OpenBLAS as a build-time option, but the gemmt there is not built by default (even in the original ReLAPACK source, see its config.h). I have not gotten around to checking if Peise's algorithm there actually works and is efficient.
fair point @martin-frbg; I'll stop arguing as I cannot devote time to help :)
Looks like this is the last blocker for 0.3.21, should this have been closed by #3548?
There are comments above comparing gemmt to syr2k. I think it is more similar to syrkx provided by cuBLAS and rocBLAS. See the links below:
- https://docs.nvidia.com/cuda/cublas/#cublas-t-syrkx
- https://docs.amd.com/bundle/rocBLAS-User-Guide---rocBLAS-documentation/page/API_Reference_Guide_80.html#rocblas-xsyrkx-batched-strided-batched
syrkx allows for op(A)op(B)^T where gemmt allows for opa(A)opb(B). If one thinks in terms of gemm, gemmt allows NN, NT, TN, TT. syrkx only allows NT or TN.
Thanks. GEMMT was added in #3796, to be released with 0.3.22, apparently I neglected to add the backlink to this issue in the PR