lapack in-place (scaled) matrix transposition: imatcopy

Intel MKL has a useful function for (scaled) in-place transposition:

https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-fortran/2024-1/mkl-imatcopy.html

I raised the issue of having this function as a utility in LAPACK proper, and Julian (@langou) expressed support for that. I'm making a note of it here so we don't lose track of this.

May 21 '24 15:05 rileyjmurray

I had a look at in-place matrix transposes at some point, but they are far from trivial to implement.

If the matrix is square, it is just a question of swapping A(i,j) and A(j,i). A simple recursive implementation is even reasonably efficient.

If the matrix is not square, the cycle lengths can be longer, so we almost certainly require some memory (not just for an efficient implementation, but for a reference too).

May 22 '24 07:05 thijssteel

Indeed cache-oblivious transpose is really hard to get it right and quite challenging in terms of low-level decisions. But I would be really happy if we get it. I do think it won't be as performant as the copy-transpose but there is still a lot to reap compared to a double loop.

May 22 '24 07:05 ilayn

copy is an odd name and description for an in-place operation. From Intel's docs, "A transposition operation can be a normal matrix copy, a transposition, a conjugate transposition, or just a conjugation. The operation is defined as follows:

AB := alpha*op(AB)."

(emphasis mine)

Since the source and destination are both AB, I gather that "a normal matrix copy" (trans=N) is basically changing the stride in place, from lda to ldb? (And applying scaling.)

Gustavson has several papers on in-place transpose, e.g., https://doi.org/10.1145/2168773.2168775. Something akin to this in implemented in PLASMA.

May 22 '24 15:05 mgates3

@mgates3, while I agree that imatcopy is not a great name, it's a fairly common one:

Intel (MKL/OneAPI): https://spec.oneapi.io/versions/1.2-rev-1/elements/oneMKL/source/domains/blas/imatcopy.html
AMD (AOCL/BLIS): https://github.com/search?q=repo%3Aamd%2Fblis%20imatcopy&type=code
Arm (APL): https://developer.arm.com/documentation/101004/2404/BLAS-Basic-Linear-Algebra-Subprograms/BLAS-Extensions/dimatcopy?lang=en

In situations like these I think it's preferable to go with the crowd, and just have good documentation about what can be done with the function.

@thijssteel and @ilayn: it'd be fine with me to use some workspace. When I said in-place, I really just meant "don't force the user to make a full copy."

May 24 '24 16:05 rileyjmurray

@rileyjmurray Respectfully, I disagree. If Reference BLAS / LAPACK is a (de facto) standard, it should be reasonably self-consistent. imatcopy and omatcopy don't follow the BLAS naming scheme in any way, though the arguments seem conforming. If we propose a standard name, it should be trivial for libraries that have imatcopy to add a new name.

This goes to my point on gemmt and batch gemm, that we as a community need a mechanism to propose and adopt new standard routines.

May 24 '24 18:05 mgates3

@mgates3, fair enough! I don't actually have a strong opinion on what the function should be called.

May 24 '24 18:05 rileyjmurray

https://www.netlib.org/blas/blast-forum/blas-report.pdf suggests to use xGE_TRANS for in-place transpose operation and xGE_COPY for out-place transpose operations.

Feb 10 '25 02:02 foxtran

It doesn't appear that those ge_trans and ge_copy were ever implemented. They aren't in the reference BLAS and LAPACK. However, LAPACKE does implement an out-of-place LAPACKE_zge_trans, among other transpose functions. These implementations seem pretty basic, not sophisticated high-performance implementations.

Feb 10 '25 08:02 mgates3