Cytnx LAPACKE library configuration

It looks like cytnx assumes that LAPACKE will be available with OpenBLAS. While OpenBLAS does include LAPACKE bindings, it appears that Ubuntu doesn't include lapacke in libopenblas.so, and instead there is a separate package for liblapacke-dev and lapacke.so.

ChatGPT suggested the following change:

diff --git a/CytnxBKNDCMakeLists.cmake b/CytnxBKNDCMakeLists.cmake
index fc631672..0254ad7d 100644
--- a/CytnxBKNDCMakeLists.cmake
+++ b/CytnxBKNDCMakeLists.cmake
@@ -11,24 +11,51 @@ endif() #use_mkl
 ######################################################################
 ### Find BLAS and LAPACK
 ######################################################################
+
 if( NOT (DEFINED BLAS_LIBRARIES AND DEFINED LAPACK_LIBRARIES))
   if (USE_MKL)
-    #set(BLA_VENDOR Intel10_64ilp)
+    # MKL-specific setup (if using Intel MKL)
     set(BLA_VENDOR Intel10_64_dyn)
     find_package( BLAS REQUIRED)
     find_package( LAPACK REQUIRED)
-    #find_package(MKL REQUIRED)
     target_link_libraries(cytnx PUBLIC ${LAPACK_LIBRARIES})
     message( STATUS "LAPACK found: ${LAPACK_LIBRARIES}" )
   else()
+    # OpenBLAS setup
     set(BLA_VENDOR OpenBLAS)
     find_package( BLAS REQUIRED)
     find_package( LAPACK REQUIRED)
-    target_link_libraries(cytnx PUBLIC ${LAPACK_LIBRARIES})
+
+    # Test if LAPACKE is available in OpenBLAS (check for LAPACKE symbols)
+    # OpenBLAS should include LAPACKE, so we just link it and check if LAPACKE symbols exist
+
+    target_link_libraries(cytnx PUBLIC ${BLAS_LIBRARIES})
+
+    # Check if LAPACKE symbols are found in the linked BLAS library
+    execute_process(
+      COMMAND nm -D ${BLAS_LIBRARIES} | grep -q LAPACKE_zheev
+      RESULT_VARIABLE lapacke_check_result
+    )
+
+    if(NOT lapacke_check_result EQUAL 0)
+      # If LAPACKE symbols are not found in OpenBLAS, search for liblapacke separately
+      find_library(LAPACKE_LIB NAMES lapacke PATHS /usr/lib/x86_64-linux-gnu)
+
+      if(LAPACKE_LIB)
+        target_link_libraries(cytnx PUBLIC ${LAPACKE_LIB})
+        message(STATUS "LAPACKE found: ${LAPACKE_LIB}")
+      else()
+        message(WARNING "LAPACKE not found. Make sure it's installed if needed.")
+      endif()
+    else()
+      message(STATUS "LAPACKE symbols found in OpenBLAS.")
+    endif()
+
     message( STATUS "LAPACK found: ${LAPACK_LIBRARIES}" )
   endif()
 
 else()
+  # If LAPACK_LIBRARIES and BLAS_LIBRARIES are already defined, just link them
   set(LAPACK_LIBRARIES  ${BLAS_LIBRARIES}  ${LAPACK_LIBRARIES})
   message( STATUS "LAPACK found: ${LAPACK_LIBRARIES}")
   target_link_libraries(cytnx PUBLIC ${LAPACK_LIBRARIES})

However this looks a bit dodgy to me, running nm command to see if the appropriate symbol is included in the blas library. Possibly better to use something like FindLAPACKE.cmake from https://github.com/isl-org/Open3D/blob/main/3rdparty/cmake/FindLAPACKE.cmake . That file is all over github, I am not sure if there is a 'canonical' version. But I think there is little advantage to using lapacke vs lapack -- the handling of row/column major is slightly trickier, but it only has to be done once in the wrapper, and it isn't really a problem.

Nov 22 '24 07:11 ianmccul

Yes, I am aware of this issue. (See also https://github.com/Cytnx-dev/Cytnx/issues/473#issuecomment-2351646127)

I don't think there is a 'canonical' version of FindLAPACKE. One simple way (as I see in some of the projects) is to ask cmake to find 'lapacke.h'. However, it might find 'lapacke.h' which does not come with OpenBLAS.

My original thinking is: Officially, we only support MKL and OpenBLAS. Then we can consider including OpenBLAS as a submodule. If you compile OpenBLAS, the source code includes all the necessary header files. The downside is that it will increase the compiling time. However, I cannot find any good/canonical solution to find the necessary header files.

Nov 22 '24 08:11 pcchen

My suggestion is to remove the dependence on lapacke.h. Just use the generic BLAS functions. If a specific BLAS library is configured that has some extensions, then make use of it, but otherwise any BLAS (even reference BLAS) should "just work".

Nov 22 '24 08:11 ianmccul

Somehow I never compile successfully if I only install reference BLAS/LAPACK.....

Nov 22 '24 09:11 pcchen

Reference LAPACK won't work because the code expects to use lapacke.h and the LAPACKE wrappers.

Implementing wrappers that use the original LAPACK functions isn't difficult though.

Nov 22 '24 09:11 ianmccul

Well, when I say "isn't difficult", I actually mean "it is a real pain, but you only need to do it once". So wrap the basic BLAS/LAPACK calls in a sensible interface, and use the wrapper, rather than calling LAPACKE directly. This is a bit of work to write the wrappers, but once they are done it is easier to use than LAPACKE.

I have wrappers for BLAS 1/2/3 and most of LAPACK (probably everything cytnx uses, and more), although it would need some reformatting for cytnx coding style. Also needs some discussion as to what exactly the interface should look like.

Nov 22 '24 09:11 ianmccul

What is the reason that it will be easier to use than LAPACKE?

Nov 22 '24 09:11 pcchen

LAPACKE is a thin wrapper around LAPACK that handles interchange of C <-> Fortran row/column major ordering. It is still quite difficult to use. The Cytnx lapack functions require that the tensor is in contiguous row-major format, with the leading dimension of the matrix equal to $N$. This shouldn't be necessary, but I can see why it is done that way, because it simplifies calling the LAPACKE functions.

Wrapper functions, even if you use LAPACKE, can do a much better job because you can isolate the code to handle the stride pattern of the data and handle the leading dimension properly, and eg automatically switch between row/column major if the indices have been permuted (currently cytnx will make a copy and transpose the tensor in that case). They can also be generic, so that you have a wrapper say gemm() that works for float, double, complex<float>, and complex<double>, and then much of the linear algebra stuff can use use a single template function to get all combinations (4x code reduction!).

You only want to write a wrapper once for each BLAS/LAPACK function, so whether that wrapper uses LAPACKE vs plain LAPACK does not really matter. It might as well just use plain LAPACK. (You just need to get into a Fortran mindset when you write the wrapper!)

Nov 22 '24 09:11 ianmccul

As Ors Legeza described in his talk, to get good performance on HPC using symmetry-adapted codes (where the individual blocks are small), you need to use every trick you can in order to get good performance, which include coalescing data along a common dimension (so that an individual block might have a leading dimension >> N), avoiding explicit transposes (so automatically call LAPACK with transposed T/N flags swapped), making use of non-zero strides (strided/batched GEMM etc). Handling this at the site of every call to a lapack function is insane, it needs some intelligent wrappers.

Nov 22 '24 09:11 ianmccul

See also https://github.com/kokkos/stdBLAS

I have been following the development of this library a bit - it is very interesting, also proposed for C++26. I would be a bit surprised if it is accepted - it is really big and I can't imagine that the compiler vendors will want to implement it (maybe CLANG can use the reference implementation(?), but GNU would re-implement it. So would Microsoft, but they are rich). I am also a bit hesitant to use it myself, mainly because I want more control over the underlying BLAS library calls and integration with cuBLAS etc. It is a separate proposal for some C++ linear algebra classes (although I think the intention is that the linear algebra would use stdBLAS, but maybe not in the initial implementation).

But the mdspan that stdBLAS uses is very useful (and part of C++23!). I am not using it now, but I want to. mdspan is a 'view' of a block of memory (like std::span in C++20), and it is flexible enough that it can map exactly onto the storage layouts supported by BLAS, and anything else you can imagine. Wrappers taking an mdspan can check that the layouts of the arguments are compatible, allocate some temporary storage if not, and forward to the actual BLAS/LAPACK function. It can also support strided batched layouts that would map onto cuBLAS gemmStridedBatched, and probably generic discontiguous batched layouts that map to gemmBatched if you wanted (but there are probably better approaches to that).

Nov 22 '24 11:11 ianmccul

In the old uni10, we hand coded the C wrappers to interact with the LAPACK fortran interface directly. The issue was this requires careful checking of the wrappers. I also recall some issues with Complex type.

A temporary solution might be include lapacke.h and related files from netlib in the include directories: https://www.netlib.org/lapack/

Nov 22 '24 15:11 yingjerkao

We also need to be careful here. lapacke was the general solution on conda, and we are moving away from conda now (The reason is apparent. The whole community is moving away). I'd love to see alternative.

Then my next question are two fold.

These new features on C++ is nice but we still want to be compatible with at least C++20. Or we don't care then lets just do the latest of C++ since no one except us will use it anyway.
another issue is if there is any better stuffs out there then Id love to just using them but how stdBLAS is actually a "std"? Its a very good project but is it actually an "std"? Personally Ive been eagering to see something like that as a linalg to be into std, or at least like numpy to be a standard on C++ but so far for what I heard there is nothing there yet.

Nov 23 '24 06:11 kaihsin

LAPACKE is a thin wrapper around LAPACK that handles interchange of C <-> Fortran row/column major ordering. It is still quite difficult to use. The Cytnx lapack functions require that the tensor is in contiguous row-major format, with the leading dimension of the matrix equal to N . This shouldn't be necessary, but I can see why it is done that way, because it simplifies calling the LAPACKE functions.

Wrapper functions, even if you use LAPACKE, can do a much better job because you can isolate the code to handle the stride pattern of the data and handle the leading dimension properly, and eg automatically switch between row/column major if the indices have been permuted (currently cytnx will make a copy and transpose the tensor in that case). They can also be generic, so that you have a wrapper say gemm() that works for float, double, complex<float>, and complex<double>, and then much of the linear algebra stuff can use use a single template function to get all combinations (4x code reduction!).

You only want to write a wrapper once for each BLAS/LAPACK function, so whether that wrapper uses LAPACKE vs plain LAPACK does not really matter. It might as well just use plain LAPACK. (You just need to get into a Fortran mindset when you write the wrapper!)

Btw, this is now numpy doing it, they have an abstraction to switching btwn these convention. FYI

Nov 23 '24 06:11 kaihsin

We also need to be careful here. lapacke was the general solution on conda, and we are moving away from conda now (The reason is apparent. The whole community is moving away). I'd love to see alternative.

Then my next question are two fold.

These new features on C++ is nice but we still want to be compatible with at least C++20. Or we don't care then lets just do the latest of C++ since no one except us will use it anyway.

I think C++17 is fine, too early to switch to C++20. I do not propose using stdBLAS at all - I think it is probably a good solution for 'ordinary' programs, but not a HPC library that wants full control over the execution environment. See also the discussion on using C++17/20 parallel algorithms at https://github.com/Cytnx-dev/Cytnx/issues/511#issuecomment-2495437334 for my opinion on that.

I do think that mdspan is absolutely the way to go. I do not propose to use C++23 std::mdspan (which is not yet in gcc stdlib, even for c++23 mode), but instead use https://github.com/kokkos/mdspan This is header-only . A Tensor is then basically some storage, together with an mdspan that describes the data layout.

another issue is if there is any better stuffs out there then Id love to just using them but how stdBLAS is actually a "std"? Its a very good project but is it actually an "std"? Personally Ive been eagering to see something like that as a linalg to be into std, or at least like numpy to be a standard on C++ but so far for what I heard there is nothing there yet.

It is a proposal for C++26. Very much experimental status still. There is a separate proposal for a matrix class https://github.com/cplusplus/papers/issues/169 which hasn't been updated for a while, I am not sure where it is going. But some of the design is a good model to follow.

Nov 23 '24 13:11 ianmccul

Btw, this is now numpy doing it, they have an abstraction to switching btwn these convention. FYI

My own wrappers do this too, eg from https://github.com/mptoolkit/mptoolkit/blob/12d09b4aeaf67f059e0e2d97efb77b2b25971fdc/linearalgebra/matrixproductblas_impl.cc#L57-L122

If the storage layout isn't contiguous in one of the dimensions, then it makes a copy of the matrix. This is slightly ugly because there are 8 different combinations of the left hand side and two right hand sides. But the other dimension doesn't have to be contiguous, so you can multiply ranges and slices of matrices (a slice will work without a copy if it is the leading dimension, but it needs to copy for a slice in the trailing dimension because then it isn't contiguous). When it calls dgemm(), it just has to get the right mode for the layout of the matrices.

   if (blas_trans_col(lhs) == 'T')
   {
      BLAS::dgemm(blas_trans_row(m2), blas_trans_row(m1), size2(m2), size1(m1), size2(m1),
                  1, data(m2), leading_dimension(m2),
                  data(m1), leading_dimension(m1),
                  0, data(lhs), leading_dimension(lhs));
   }
   else
   {
      BLAS::dgemm(blas_trans_col(m1), blas_trans_col(m2), size1(m1), size2(m2), size2(m1),
                  1, data(m1), leading_dimension(m1),
                  data(m2), leading_dimension(m2),
                  0, data(lhs), leading_dimension(lhs));
   }
}

You can write code like

int main() {
   Matrix<double> M1(10,10);
   Matrix<double> M2(5,5);

   Matrix<double> M3(10,5);
   M3(range(2,7),all) = M1(range(0,5), range(0,5)) * trans(M2);   // dgemm, no temporaries, no copies
}

Nov 23 '24 13:11 ianmccul