kokkos-kernels icon indicating copy to clipboard operation
kokkos-kernels copied to clipboard

Nightly cuda/12.0, cuda/11.8 unit test failures

Open ndellingwood opened this issue 2 years ago • 16 comments

Sub-tests are failing in cuda/12.0 builds with the batched_dla_cuda and batched_gemm_cuda unit tests with error message cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address

batched_dla_cuda

00:37:42 3: [ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
00:37:42 3: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 3: Backtrace:
00:37:42 3:                         [0x675133]
00:37:42 3:                         [0x66d398]
00:37:42 3:                         [0x66d3cb]
00:37:42 3:                         [0x67b0fd]
00:37:42 3:                         [0x67bc91]
00:37:42 3:                         [0x651a65]
00:37:42 3:                         [0x5fd33a]
00:37:42 3:                         [0x43c787]
00:37:42 3:                         [0x64f34d]
00:37:42 3:                         [0x6432f3]
00:37:42 3:                         [0x6437a5]
00:37:42 3:                         [0x64398e]
00:37:42 3:                         [0x649a56]
00:37:42 3:                         [0x649cdb]
00:37:42 3:                         [0x413232]
00:37:42 3: __libc_start_main [0x7f7d99a18555]
00:37:42 3:                         [0x41acad]

batched_gemm_cuda

00:37:42 4: [ RUN      ] Cuda.batched_scalar_serial_gemm_nt_nt_dcomplex_dcomplex
00:37:42 4: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 4: Backtrace:
00:37:42 4:                         [0x7c22d3]
00:37:42 4:                         [0x7ba538]
00:37:42 4:                         [0x7ba56b]
00:37:42 4:                         [0x7c829d]
00:37:42 4:                         [0x7c8e31]
00:37:42 4:                         [0x79eb35]
00:37:42 4:                         [0x45e0b1]
00:37:42 4:                         [0x6fda73]
00:37:42 4:                         [0x6fe1d6]
00:37:42 4:                         [0x79c41d]
00:37:42 4:                         [0x7903c3]
00:37:42 4:                         [0x790875]
00:37:42 4:                         [0x790a5e]
00:37:42 4:                         [0x796b26]
00:37:42 4:                         [0x796dab]
00:37:42 4:                         [0x40e032]
00:37:42 4: __libc_start_main [0x7fb3be79b555]
00:37:42 4:                         [0x4170dd]

Reproducer (kokkos-dev-2):

source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh ; module use /home/projects/x86-64/modulefiles/local
module purge
module load sems-archive-env sems-archive-cmake/3.17.1 sems-archive-gcc/9.2.0 cuda/12.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="17" --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda   --no-examples

ndellingwood avatar Feb 01 '23 22:02 ndellingwood

@lucbv: Do you have any notes on this so I can pickup from where you left off or do you want to pair up?

e10harvey avatar Aug 10 '23 16:08 e10harvey

Notes:

  • Binaries in issue1663 build directory.
  • Use https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html#memcheck-tool
  • If misalignment originates from a dcomplex view, verify that KOKKOS_ENABLE_COMPLEX_ALIGN is defined in kokkos core.

e10harvey avatar Aug 23 '23 17:08 e10harvey

Relevant snippet from memcheck:

========= Invalid __local__ read of size 16 bytes
=========     at 0xdeadbeef in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Test::SerialInverseLU::Functor_BatchedSerialGemm<Kokkos::Cuda, Kokkos::View<Kokkos::complex<double> ***, Kokkos::LayoutLeft, Kokkos::Cuda>, Kokkos::complex<double>, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>, KokkosBlas::Algo::Level3::Blocked>, Kokkos::RangePolicy<Kokkos::Cuda, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>>, Kokkos::Cuda>>(T1

Note that all inverselu invalid reads come from the Blocked algo type.

e10harvey avatar Aug 23 '23 18:08 e10harvey

Note: Cuda/12 wants all addresses 16-byte aligned but, in the BatchedSerialGemm Blocked implementation, we de-reference a address that is 8-byte aligned.

TODO: Print out pointer scalar types and their size as well as the starting addresses of views/subviews.

e10harvey avatar Aug 23 '23 18:08 e10harvey

After more debugging I have determined that the misalignment is stemming from Functor_BatchedSerialGemm in Test_Batched_SerialInverseLU.hpp of an address outside the control of the parallel_for caller.

e10harvey avatar Aug 28 '23 15:08 e10harvey

Given that the functor in question does not use any addresses that are violating 16-byte alignment nor do locals (&_alpha or &_beta) violate 16-byte alignment, I believe this is either a Kokkos Core or a compiler bug. Regardless of where the bug stems from, we should ask someone from Cuda or Kokkos Core to investigate further.

e10harvey avatar Aug 28 '23 19:08 e10harvey

Here are more triaging results. Note that local memory can only be allocated by the compiler.

  1. Christian and I tried moving Scalar _alpha, _beta above the declaration of the _a, _b, _c locals in the functor class definition:
template <typename DeviceType, typename ViewType, typename ScalarType,
          typename ParamTagType, typename AlgoTagType>
struct Functor_BatchedSerialGemm {
  ScalarType _alpha, _beta;
  ViewType _a, _b, _c;

This change resulted in passing tests in cuda/12.0.

e10harvey avatar Aug 28 '23 22:08 e10harvey

The (register allocation bug?) still persists in cuda/12.2.

KokkosKernels HEAD SHA: 6c06bd024bbcb48b1ca6bef165bd13e73a3c3b44 Kokkos HEAD SHA: 7e299b4e25c42528e105379c3aa9a318056545ba

Local changes in KokkosKernels: kk_local_changes.txt

Local change in Kokkos: none.

module load sems-archive-env sems-archive-cmake/3.17.1 gcc/11 nvhpc/23.7
make -j16 KokkosKernels_batched_dla_cuda
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
mark2
i:0
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a0e23] 
[0x69b148] 
[0x69b17b] 
[0x6a6ee7] 
[0x6a786d] 
[0x67f4a5] 
[0x65961a] 
[0x47829d] 
[0x67cf8d] 
[0x672b98] 
[0x67332d] 
[0x673544] 
[0x6773d4] 
[0x672316] 
[0x4193c2] 
[0x7fd63cc7d555] __libc_start_main
[0x420b6d] 
Aborted (core dumped)
  1. Here is some additional sizeof and aligof information using cuda/12.2:
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
mark2
i:0
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:1
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df36680
c0:0x7fc68df3ea80
w:0x7fc68df3a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df42680
c0:0x7fc68df62a80
w:0x7fc68df52880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:3
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df56680
c0:0x7fc68df9ea80
w:0x7fc68df7a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:4
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df72680
c0:0x7fc67d400080
w:0x7fc68dfb2880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:5
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df96680
c0:0x7fc67d464280
w:0x7fc67d400080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:6
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67d520480
w:0x7fc67d490280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:7
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67dc00080
w:0x7fc67d4c4280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:8
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:9
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
mark2
i:0
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:1
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df36680
c0:0x7fc68df3ea80
w:0x7fc68df3a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df42680
c0:0x7fc68df62a80
w:0x7fc68df52880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:3
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df56680
c0:0x7fc68df9ea80
w:0x7fc68df7a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:4
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df72680
c0:0x7fc67d400080
w:0x7fc68dfb2880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:5
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df96680
c0:0x7fc67d464280
w:0x7fc67d400080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:6
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67d520480
w:0x7fc67d490280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:7
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67dc00080
w:0x7fc67d4c4280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:8
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:9
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
[       OK ] Cuda.batched_scalar_serial_inverselu_dcomplex (116 ms)
[----------] 1 test from Cuda (116 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (116 ms total)
[  PASSED  ] 1 test.

NOTE: You have to comment out the following prints in the operator to trigger misalignment:

  KOKKOS_INLINE_FUNCTION
  void operator()(const ParamTagType &, const int k) const {
    auto aa = Kokkos::subview(_a, k, Kokkos::ALL(), Kokkos::ALL());
    auto bb = Kokkos::subview(_b, k, Kokkos::ALL(), Kokkos::ALL());
    auto cc = Kokkos::subview(_c, k, Kokkos::ALL(), Kokkos::ALL());

    /* if (k == 0) {
      printf("In Operator: alignof(decltype(*this)):%lu\n", alignof(decltype(*this)));
      printf("In Operator: alignof(decltype(ViewType)):%lu\n", alignof(ViewType));
      printf("In Operator: alignof(decltype(ScalarType)):%lu\n", alignof(ScalarType));
    } */
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7ff6a7f32480
a1:0x7ff6a7f32680
c0:0x7ff6a7f32a80
w:0x7ff6a7f32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffd8a1ea7b0
&_beta:0x7ffd8a1ea7c0
mark2
i:0
mark0
mark1
a0:0x7ff6a7f32480
a1:0x7ff6a7f32680
c0:0x7ff6a7f32a80
w:0x7ff6a7f32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffd8a1ea7b0
&_beta:0x7ffd8a1ea7c0
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a11b3] 
[0x69b4d8] 
[0x69b50b] 
[0x6a7277] 
[0x6a7bfd] 
[0x67f835] 
[0x65996e] 
[0x47828d] 
[0x67d31d] 
[0x672f28] 
[0x6736bd] 
[0x6738d4] 
[0x677764] 
[0x6726a6] 
[0x419382] 
[0x7ff6d284e555] __libc_start_main
[0x420b2d] 
Aborted (core dumped)

e10harvey avatar Aug 28 '23 22:08 e10harvey

Hello, I am looking into this bug, and came across something I found strange. If you keep all the source for the test the same, but take out one Kokkos::abort, then it seems to not hit this error message. Does anyone have an idea why that would be?

change the abort here to just return 0; or comment it out entirely.

  if (!(m <= 2 && n <= 2))
   Kokkos::abort(
        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

to

  if (!(m <= 2 && n <= 2)) return 0;
//    Kokkos::abort(
//        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

And on my machine I get no error. Because of the lack of abort, am I just missing a cudaCheckLastError call or something like that? I cant tell yet if the Kokkos::abort is an issue here, or its causing me to miss the trigger for the bug, or its not printing the Cuda error. Though when I searched through the src for cuda_abort, it looks like it just prints the message you give it. @crtrott for vis

matt-stack avatar Sep 27 '23 20:09 matt-stack

Just to update, these two tests fail with cd8f77c1c61c45bd8071bc7870b55bd045a727c9 when enabling complex_double types in builds with c++20 enabled as well using for example cuda/12.0.0 + gcc/11.3.0

ndellingwood avatar Dec 15 '23 22:12 ndellingwood

If I configure with the option -DKokkos_ENABLE_COMPLEX_ALIGN=OFF then the tests posted above pass Adding @crtrott @dalg24 @masterleinad to the loop

ndellingwood avatar Dec 15 '23 23:12 ndellingwood

The same tests fail with cuda/11.8.0 when testing with cusparse and magma tpls enabled

ndellingwood avatar Apr 13 '24 00:04 ndellingwood

Updating the issue to confirm the same tests still fail with cuda/11.8.0, cuda/12.0 +/- c++20 on Weaver (Volta70+Power9) with SHA 32aa75a8f20ca88df64bde421c335b9fa6f68397

Configuration (Weaver, cuda/12.0 w/ c++20):

bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

source /etc/profile.d/modules.sh
module load cmake git gcc/11.3.0 cuda/12.0.0

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --cxxstandard=20

Test failures:

16:17:09 The following tests FAILED:
16:17:09 	  3 - batched_dla_cuda (Subprocess aborted)
16:17:09 	  4 - batched_gemm_cuda (Subprocess aborted)

ndellingwood avatar Jun 04 '24 16:06 ndellingwood

The tests above passed on kokkos-dev-2 with sems-cuda/12.4 + sems-gcc/13.2.0

ndellingwood avatar Jun 04 '24 18:06 ndellingwood

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

lucbv avatar Jun 04 '24 18:06 lucbv

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

@lucbv on kokkos-dev-2 the configuration here (with Power9 dropped), using sems-cuda/12.4, the tests passed 100%

ndellingwood avatar Jun 04 '24 19:06 ndellingwood