KOKKOS_FUNCTION-annotated for_each and transform_reduce
Closes #172
@tpadioleau can you take a look at the two issues I mention in the first message ? The second one just seems to be an annoying bug with the support of [[maybe_unused]] in hipcc but the first one is more worrying, it seems the loop https://github.com/CExA-project/ddc/pull/708/files#diff-46765d278566dca059db5ef8ac1fc5e486964e46a1f7303302c7b7d80b9b4ab0R47 does not iterate correctly with nvcc
I don't know if i will have time to have a closer look this week. You could look for the differences with the implementation in the first PR, i think the tests were passing
I think I have the same implementation as the closed branch. But you don't have 2D test for nested for_each, and I think the bug with nvcc appears only for 2D+ cases.
and I think the bug with nvcc appears only for 2D+ cases.
Did you test ?
I just tested it and you were right to ask because the 1D case does not work to. I tried you branch and I don't get the bug. This looks very weird to me, either I am missing something obvious, either something broke somewhere since March. If you have any idea I'd take it, otherwise I will take my time for this because it does not look easy to debug.
I have found another bug. This is not visible in the tests (should I add a test which reproduces it ?). I have a use-case where the annotated_transform_reduce returns 0 with CUDA (it should not). Here is a fix:
result = reduce(
result,
annotated_transform_reduce_serial(
domain,
neutral,
reduce,
transform,
dcoords...,
ii));
Must be replaced with:
T tmp;
Kokkos::atomic_store(
&tmp,
annotated_transform_reduce_serial(
domain,
neutral,
reduce,
transform,
dcoords...,
ii));
Kokkos::atomic_store(&result, reduce(result, tmp));
(I don't understand what is the problem as result is a local per-thread variable though)
https://github.com/CExA-project/ddc/blob/73ee089dc3e9682512328773eb59c293a18e042a/include/ddc/transform_reduce.hpp#L84
All of that is weird, can you provide more information about your environment ?
- the hardware, cpu and gpu
- the version of your compiler
- the version of the libraries, cmake, cuda, kokkos...
- the complete cmake command line
- anything you think relevant
Also do all other DDC tests pass ? Do kokkos tests pass ?
It feels like the compiler optimized away some function calls
- Ubuntu 24
- Intel(R) Xeon(R) Gold 6226R + V100S
- nvcc 12.6 + g++13.3
- cmake 3.28, cuda 12.6, the Kokkos is the DDC submodule up-to-date with the main DDC branch.
- cmake -DCMAKE_CXX_COMPILER=$KOKKOS_BIN/nvcc_wrapper -DCMAKE_BUILD_TYPE=Debug -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON -DCMAKE_CXX_FLAGS="-Wfatal-errors" ..
All DDC tests pass. I don't know how to compile the Kokkos tests.
I will try quite soon on a 1070Ti with mostly the same software configuration. I will try to dig in compute-sanitizer also. Maybe if you can see if my first problem reproduces on Ruche or Persée by going in this branch and compiling with or without the static_cast<int> it may be helpful ? I agree all this is weird and i'm feeling nvcc is capricious, but it is possible that I am missing something.
From the table https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#system-requirements you are one minor version above the maximum tested host compiler version, i.e. 13.2.
Can you try with a different compiler ?
@tpadioleau despite efforts I made no progress on this branch and I have no more time to spend on it. Test is failling with CUDA. Replacing Element ii = begin[I] with ìnt ii = begin[I] at https://github.com/blegouix/similie/blob/ee2b3034e5ac270519cd9fd98a190468733c1c76/examples/2d_vector_laplacian.cpp#L121 is enough to fix but makes no sense.
I should have time to look at it in february, thanks for the work so far
Hello,
I have updated the branch and tested with CUDA12.9+gcc13. Same behavior.
I have also tested to use a functor in place of lambda, same behavior.
So, no progress :/ Looks like a CUDA bug to me.
@blegouix I will be finishing the work for this PR. I haven't been able to reproduce the bug you are reporting, even on a GTX1070, so it will be merged mostly as is.
@PaulGannay ok thank you very much!
@blegouix FYI, we have identified a behavior that looks like a bug. It appears in debug in loops. It has been reported to Nvidia.
Ok thx for the feedback, I am not surprised, the issues I encountered were very suspect. I was not sure if it was a CUDA or a Kokkos bug but something was clearly wrong.