Trilinos icon indicating copy to clipboard operation
Trilinos copied to clipboard

Tpetra: compilation errors in build with MKL enabled

Open maartenarnst opened this issue 2 years ago • 8 comments

@trilinos/tpetra @trilinos/kokkos-kernels

Description

What went wrong?

We are seeing compilation errors in a Trilinos build with MKL enabled. The compilation errors seem to follow from changes that were introduced in the recent kokkos snapshot in PR #10962.

The error messages look like

In file included from ~/Trilinos/packages/kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_mkl_impl.hpp:49,
from ~/Trilinos/packages/kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_numeric_spec.hpp:59,
from ~/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_spgemm_numeric.hpp:48,
from ~/home/costmo-user/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_spgemm.hpp:47,
from ~/Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix_def.hpp:66,
from ~/Trilinos/packages/tpetra/core/ext/TpetraExt_MatrixMatrix.cpp:49:

~/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp: 
In instantiation of 'class KokkosSparse::Impl::MKLSparseMatrix<long long int>':

~/Trilinos/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp:126:58: error: static assertion failed: Scalar type used in MKLSparseMatrix<value_type> is NOT supported by MKL
    126 |   static_assert(mkl_is_supported_value_type<value_type>::value,

Do you have an idea what might fix things?

It seems Tpetra wants to instantiate several classes using int and long long int as scalar type. This appears to cause problems here in sparse GEMM calls when MKL is enabled. It seems that kokkos-kernels is deciding whether or not to direct sparse GEMM calls to MKL by using runtime conditionals in src/sparse/impl/KokkosSparse_spgemm_numeric_spec.hpp. But, PR #10962 changed how function calls are directed to MKL and introduced a compile-time assert to check that the scalar type is supported by MKL (which int and long long int, seemingly, are not).

One way of solving this issue might be to replace the compile-time assert with runtime conditionals. Changing KokkosSparse_Utils_mkl.hpp as in the following patch solved the compilation errors in our build:

From d79bd6181a5449d4f558276165af4766f671349c Mon Sep 17 00:00:00 2001
From: Maarten Arnst <[email protected]>
Date: Sat, 17 Sep 2022 11:13:34 +0200
Subject: [PATCH] Change compile-time assert into run-time check in
 MKLSparseMatrix.

---
 .../src/sparse/KokkosSparse_Utils_mkl.hpp        | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp b/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp
index b9eb3a9bd25..4abeeca88c1 100644
--- a/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp
+++ b/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp
@@ -123,16 +123,16 @@ template <typename value_type>
 class MKLSparseMatrix {
   sparse_matrix_t mtx;
 
-  static_assert(mkl_is_supported_value_type<value_type>::value,
-                "Scalar type used in MKLSparseMatrix<value_type> is NOT "
-                "supported by MKL");
-
  public:
   inline MKLSparseMatrix(sparse_matrix_t mtx_) : mtx(mtx_) {}
 
   // Constructs MKL sparse matrix from KK sparse views (m rows x n cols)
   inline MKLSparseMatrix(const MKL_INT num_rows, const MKL_INT num_cols,
-                         MKL_INT *xadj, MKL_INT *adj, value_type *values);
+                         MKL_INT *xadj, MKL_INT *adj, value_type *values) {
+    throw std::runtime_error(
+        "Scalar type used in MKLSparseMatrix<value_type> is NOT "
+                "supported by MKL");                       
+  }
 
   // Allows using MKLSparseMatrix directly in MKL calls
   inline operator sparse_matrix_t() const { return mtx; }
@@ -140,7 +140,11 @@ class MKLSparseMatrix {
   // Exports MKL sparse matrix contents into KK views
   inline void export_data(MKL_INT &num_rows, MKL_INT &num_cols,
                           MKL_INT *&rows_start, MKL_INT *&columns,
-                          value_type *&values);
+                          value_type *&values) {
+    throw std::runtime_error(
+        "Scalar type used in MKLSparseMatrix<value_type> is NOT "
+                "supported by MKL");   
+  }
 
   inline void destroy() {
     KOKKOSKERNELS_MKL_SAFE_CALL(mkl_sparse_destroy(mtx));
-- 
2.25.1

maartenarnst avatar Sep 18 '22 09:09 maartenarnst

Can you post your configure?

csiefer2 avatar Sep 18 '22 16:09 csiefer2

Hi @csiefer2. Thanks for looking into this!

This is our configure:

CMakePresets.json.txt

It looks like GitHub doesn't allow .json files to be uploaded. So that's why I added .txt to the filename.

maartenarnst avatar Sep 19 '22 05:09 maartenarnst

Hi @csiefer2. Thanks again for solving the companion issue about the Kokkos initialisation. Just wanted to ask if you should already have had a chance to look into this MKL issue? Thank you very much in advance!

maartenarnst avatar Sep 29 '22 15:09 maartenarnst

@maartenarnst We don't test the OneAPI intel compilers, so I'm not at all surprised it doesn't work. I have a system with oneAPI installed, so I'll add MKL to that and see if I can get it to work.

csiefer2 avatar Oct 03 '22 17:10 csiefer2

Doesn't the class need a specialization for the int types? Seems this isn't a oneAPI issue, it's just enabling MKL Sparse matrices?

https://github.com/trilinos/Trilinos/blob/develop/packages/kokkos-kernels/src/sparse/KokkosSparse_Utils_mkl.hpp

jjellio avatar Oct 03 '22 18:10 jjellio

@jjellio One of the ATDM tests uses MKL with Intel 19, but not with the KokkosKernels MKL interfaces, so that could be the thing.

Building now...

csiefer2 avatar Oct 03 '22 20:10 csiefer2

@maartenarnst I can reproduce now and it looks like a "something configure should have noticed and told you that you can't build this combination of options" kind of problem.

I'll try to get the @trilinos/kokkos-kernels folks in on this.

@lucbv agreed to take a look.

csiefer2 avatar Oct 03 '22 21:10 csiefer2

Thanks, to all of you, for looking into this!

The patch in the original post may go into the direction that @jjellio is suggesting. The patch is basically providing a default implementation for MKLSparseMatrix. This default implementation throws runtime errors (replacing the static assert that is currently in the code). The specialisations for all the scalar types that MKL supports are left unchanged.

maartenarnst avatar Oct 04 '22 18:10 maartenarnst

Just wanted to follow up on this issue. @lucbv, have you already had a chance to take a look at it? It seems the issue was also encountered in #11189. Thanks.

maartenarnst avatar Nov 02 '22 10:11 maartenarnst

Hi, I am also facing the exact same issue with MKL and Intel oneAPI compilers. However I can add that it was working fine as of 74f5468. Is there another workaround in the meantime?

nrsharan avatar Nov 04 '22 09:11 nrsharan

Adding @brian-kelley

ndellingwood avatar Nov 15 '22 20:11 ndellingwood

@lucbv have you had a chance to look at this? It is starting to become critical for @hkthorn.

ccober6 avatar Nov 15 '22 21:11 ccober6

@maartenarnst I replicated this issue and verified that your changes fix it. They look good to me; I opened #11278 with them.

@ndellingwood I put you as a reviewer on that. If you agree it looks good, I will open an equivalent PR against KokkosKernels develop so this can go in 3.7.01.

brian-kelley avatar Nov 15 '22 23:11 brian-kelley

@brian-kelley thanks for the fast PR! Approved with automerge label applied

ndellingwood avatar Nov 16 '22 00:11 ndellingwood

It appears that a few PRs have been merged related to this. Can this issue be closed or are there other required pieces?

ccober6 avatar Nov 16 '22 21:11 ccober6

@ccober6 Xyce testing of this change in the Trilinos development branch happens once Charon blesses SHA-1 tag through their Trilinos development branch testing. I will add a comment when the chain of events produces a clean Xyce build/test against Trilinos development so this can be closed. Is that okay?

hkthorn avatar Nov 16 '22 21:11 hkthorn

@hkthorn Yes, let's make sure it works for you. What is the time frame until the Xyce build/test?

ccober6 avatar Nov 16 '22 21:11 ccober6

@ccober6 Our Trilinos development testing is up and running again. Thanks!!!

hkthorn avatar Nov 21 '22 17:11 hkthorn

@hkthorn Yeah!! Glad this is working!

ccober6 avatar Nov 21 '22 17:11 ccober6