lapack
lapack copied to clipboard
Correctness and efficiency of LAPACKE_xLACPY when matrix_layout is LAPACK_ROW_MAJOR
Description
Going by the discussion here (which I'm summarizing below), it seems that lapacke_slacpy and friends are (i) copying more than twice as much when asked to copy only the lower or upper triangular parts (ii) copying over the uninitialized malloc-ed region back to b.
https://github.com/Reference-LAPACK/lapack/blob/28f7e8309608b92aaec2e2556d4b25d758ccada9/LAPACKE/src/lapacke_slacpy_work.c#L70-L76
This results in the following code working as expected while using LAPACK_COL_MAJOR but not working as expected while using LAPACK_ROW_MAJOR.
Code, compile using gcc test-lacpy.c -llapacke -o test-lacpy:
#include <lapacke.h>
#include <stdio.h>
#define size 9
int main(){
float a[size] = {1,2,3,4,5,6,7,8,9};
float b[size];
for(int i=0; i<size; i++) b[i] = 100;
// Works as expected with LAPACK_COL_MAJOR but not with LAPACK_ROW_MAJOR
LAPACKE_slacpy(LAPACK_COL_MAJOR, 'U', 3, 3, a, 3, b, 3);
for(int i=0; i<3; i++){
for(int j=0; j<3; j++){
printf("%f ", b[i*3+j]);
}
printf("\n");
}
return 0;
}
When compiled with LAPACK_COL_MAJOR, this produces the output:
1.000000 100.000000 100.000000
4.000000 5.000000 100.000000
7.000000 8.000000 9.000000
But when compiled with LAPACK_ROW_MAJOR, this produces the unexpected output (expectation is 100 instead of 0):
1.000000 2.000000 3.000000
0.000000 5.000000 6.000000
0.000000 0.000000 9.000000
Might a native C solution be preferable for this particular function/family?
Checklist
- [x] I've included a minimal example to reproduce the issue
- [ ] I'd be willing to make a PR to solve this issue (EDIT: I'd be willing to make a PR once someone confirms the plan of action.)
PS: Credits to stassats for pointing out the bug!
That's an interesting corner case. Strictly speaking, the code is correct
since B = A in all upper triangular entries, but the "copy over" is likely an expected side effect.
I would support treating B as an in,out variable. Something like:
/* Transpose input matrices */
LAPACKE_sge_trans( matrix_layout, m, n, a, lda, a_t, lda_t );
+ if (LAPACKE_lsame(uplo, 'L') || LAPACKE_lsame(uplo, 'U')) {
+ LAPACKE_sge_trans( matrix_layout, m, n, b, ldb, b_t, ldb_t );
+ }
/* Call LAPACK function and adjust info */
LAPACK_slacpy( &uplo, &m, &n, a_t, &lda_t, b_t, &ldb_t );
In the cases when uplo is L or U, copying should involve copying of half the array. However, going down the path of transposing the inputs and then the outputs, would involve memory allocation, as well as transposing the 2 input arrays, and 1 output array, effectively needing 3.5 array copying to copy what should essentially have required copying just half the array.
This is why I wonder if a native C solution might be preferable for this particular function for the ROW_MAJOR layout: write the function using if-else statements and for-loops and never call the malloc, LAPACKE_sge_trans, LAPACKE_slacpy functions at all.
It also seems possible to call the slacpy function directly with appropriately changed arguments, without the intermediate transposes.