ddc KOKKOS_FUNCTION-annotated for_each and transform

Closes #172

Dec 15 '24 13:12 blegouix

@tpadioleau can you take a look at the two issues I mention in the first message ? The second one just seems to be an annoying bug with the support of [[maybe_unused]] in hipcc but the first one is more worrying, it seems the loop https://github.com/CExA-project/ddc/pull/708/files#diff-46765d278566dca059db5ef8ac1fc5e486964e46a1f7303302c7b7d80b9b4ab0R47 does not iterate correctly with nvcc

Dec 16 '24 13:12 blegouix

I don't know if i will have time to have a closer look this week. You could look for the differences with the implementation in the first PR, i think the tests were passing

Dec 16 '24 16:12 tpadioleau

I think I have the same implementation as the closed branch. But you don't have 2D test for nested for_each, and I think the bug with nvcc appears only for 2D+ cases.

Dec 16 '24 16:12 blegouix

and I think the bug with nvcc appears only for 2D+ cases.

Did you test ?

Dec 16 '24 19:12 tpadioleau

I just tested it and you were right to ask because the 1D case does not work to. I tried you branch and I don't get the bug. This looks very weird to me, either I am missing something obvious, either something broke somewhere since March. If you have any idea I'd take it, otherwise I will take my time for this because it does not look easy to debug.

Dec 17 '24 10:12 blegouix

I have found another bug. This is not visible in the tests (should I add a test which reproduces it ?). I have a use-case where the annotated_transform_reduce returns 0 with CUDA (it should not). Here is a fix:

result = reduce(
        result,
        annotated_transform_reduce_serial(
                domain,
                neutral,
                reduce,
                transform,
                dcoords...,
                ii));

Must be replaced with:

T tmp;
Kokkos::atomic_store(
      &tmp,
      annotated_transform_reduce_serial(
                domain,
                neutral,
                reduce,
                transform,
                dcoords...,
                ii));
Kokkos::atomic_store(&result, reduce(result, tmp));

(I don't understand what is the problem as result is a local per-thread variable though)

https://github.com/CExA-project/ddc/blob/73ee089dc3e9682512328773eb59c293a18e042a/include/ddc/transform_reduce.hpp#L84

Jan 01 '25 15:01 blegouix

All of that is weird, can you provide more information about your environment ?

the hardware, cpu and gpu
the version of your compiler
the version of the libraries, cmake, cuda, kokkos...
the complete cmake command line
anything you think relevant

Also do all other DDC tests pass ? Do kokkos tests pass ?

It feels like the compiler optimized away some function calls

Jan 01 '25 15:01 tpadioleau

Ubuntu 24
Intel(R) Xeon(R) Gold 6226R + V100S
nvcc 12.6 + g++13.3
cmake 3.28, cuda 12.6, the Kokkos is the DDC submodule up-to-date with the main DDC branch.
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_BIN/nvcc_wrapper -DCMAKE_BUILD_TYPE=Debug -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON -DCMAKE_CXX_FLAGS="-Wfatal-errors" ..

All DDC tests pass. I don't know how to compile the Kokkos tests.

I will try quite soon on a 1070Ti with mostly the same software configuration. I will try to dig in compute-sanitizer also. Maybe if you can see if my first problem reproduces on Ruche or Persée by going in this branch and compiling with or without the static_cast<int> it may be helpful ? I agree all this is weird and i'm feeling nvcc is capricious, but it is possible that I am missing something.

Jan 01 '25 20:01 blegouix

From the table https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#system-requirements you are one minor version above the maximum tested host compiler version, i.e. 13.2.

Can you try with a different compiler ?

Jan 01 '25 21:01 tpadioleau

@tpadioleau despite efforts I made no progress on this branch and I have no more time to spend on it. Test is failling with CUDA. Replacing Element ii = begin[I] with ìnt ii = begin[I] at https://github.com/blegouix/similie/blob/ee2b3034e5ac270519cd9fd98a190468733c1c76/examples/2d_vector_laplacian.cpp#L121 is enough to fix but makes no sense.

Jan 31 '25 12:01 blegouix

I should have time to look at it in february, thanks for the work so far

Jan 31 '25 17:01 tpadioleau

Hello,

I have updated the branch and tested with CUDA12.9+gcc13. Same behavior.

I have also tested to use a functor in place of lambda, same behavior.

So, no progress :/ Looks like a CUDA bug to me.

Aug 25 '25 19:08 blegouix

@blegouix I will be finishing the work for this PR. I haven't been able to reproduce the bug you are reporting, even on a GTX1070, so it will be merged mostly as is.

Nov 20 '25 10:11 PaulGannay

@PaulGannay ok thank you very much!

Nov 20 '25 12:11 blegouix

@blegouix FYI, we have identified a behavior that looks like a bug. It appears in debug in loops. It has been reported to Nvidia.

Dec 05 '25 12:12 tpadioleau

Ok thx for the feedback, I am not surprised, the issues I encountered were very suspect. I was not sure if it was a CUDA or a Kokkos bug but something was clearly wrong.

Dec 05 '25 14:12 blegouix

KOKKOS_FUNCTION-annotated for_each and transform_reduce