ddc icon indicating copy to clipboard operation
ddc copied to clipboard

KOKKOS_FUNCTION-annotated for_each and transform_reduce

Open blegouix opened this issue 1 year ago • 14 comments

Closes #172

blegouix avatar Dec 15 '24 13:12 blegouix

@tpadioleau can you take a look at the two issues I mention in the first message ? The second one just seems to be an annoying bug with the support of [[maybe_unused]] in hipcc but the first one is more worrying, it seems the loop https://github.com/CExA-project/ddc/pull/708/files#diff-46765d278566dca059db5ef8ac1fc5e486964e46a1f7303302c7b7d80b9b4ab0R47 does not iterate correctly with nvcc

blegouix avatar Dec 16 '24 13:12 blegouix

I don't know if i will have time to have a closer look this week. You could look for the differences with the implementation in the first PR, i think the tests were passing

tpadioleau avatar Dec 16 '24 16:12 tpadioleau

I think I have the same implementation as the closed branch. But you don't have 2D test for nested for_each, and I think the bug with nvcc appears only for 2D+ cases.

blegouix avatar Dec 16 '24 16:12 blegouix

and I think the bug with nvcc appears only for 2D+ cases.

Did you test ?

tpadioleau avatar Dec 16 '24 19:12 tpadioleau

I just tested it and you were right to ask because the 1D case does not work to. I tried you branch and I don't get the bug. This looks very weird to me, either I am missing something obvious, either something broke somewhere since March. If you have any idea I'd take it, otherwise I will take my time for this because it does not look easy to debug.

blegouix avatar Dec 17 '24 10:12 blegouix

I have found another bug. This is not visible in the tests (should I add a test which reproduces it ?). I have a use-case where the annotated_transform_reduce returns 0 with CUDA (it should not). Here is a fix:

result = reduce(
        result,
        annotated_transform_reduce_serial(
                domain,
                neutral,
                reduce,
                transform,
                dcoords...,
                ii));

Must be replaced with:

T tmp;
Kokkos::atomic_store(
      &tmp,
      annotated_transform_reduce_serial(
                domain,
                neutral,
                reduce,
                transform,
                dcoords...,
                ii));
Kokkos::atomic_store(&result, reduce(result, tmp));

(I don't understand what is the problem as result is a local per-thread variable though)

https://github.com/CExA-project/ddc/blob/73ee089dc3e9682512328773eb59c293a18e042a/include/ddc/transform_reduce.hpp#L84

blegouix avatar Jan 01 '25 15:01 blegouix

All of that is weird, can you provide more information about your environment ?

  • the hardware, cpu and gpu
  • the version of your compiler
  • the version of the libraries, cmake, cuda, kokkos...
  • the complete cmake command line
  • anything you think relevant

Also do all other DDC tests pass ? Do kokkos tests pass ?

It feels like the compiler optimized away some function calls

tpadioleau avatar Jan 01 '25 15:01 tpadioleau

  • Ubuntu 24
  • Intel(R) Xeon(R) Gold 6226R + V100S
  • nvcc 12.6 + g++13.3
  • cmake 3.28, cuda 12.6, the Kokkos is the DDC submodule up-to-date with the main DDC branch.
  • cmake -DCMAKE_CXX_COMPILER=$KOKKOS_BIN/nvcc_wrapper -DCMAKE_BUILD_TYPE=Debug -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON -DCMAKE_CXX_FLAGS="-Wfatal-errors" ..

All DDC tests pass. I don't know how to compile the Kokkos tests.

I will try quite soon on a 1070Ti with mostly the same software configuration. I will try to dig in compute-sanitizer also. Maybe if you can see if my first problem reproduces on Ruche or Persée by going in this branch and compiling with or without the static_cast<int> it may be helpful ? I agree all this is weird and i'm feeling nvcc is capricious, but it is possible that I am missing something.

blegouix avatar Jan 01 '25 20:01 blegouix

From the table https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#system-requirements you are one minor version above the maximum tested host compiler version, i.e. 13.2.

Can you try with a different compiler ?

tpadioleau avatar Jan 01 '25 21:01 tpadioleau

@tpadioleau despite efforts I made no progress on this branch and I have no more time to spend on it. Test is failling with CUDA. Replacing Element ii = begin[I] with ìnt ii = begin[I] at https://github.com/blegouix/similie/blob/ee2b3034e5ac270519cd9fd98a190468733c1c76/examples/2d_vector_laplacian.cpp#L121 is enough to fix but makes no sense.

blegouix avatar Jan 31 '25 12:01 blegouix

I should have time to look at it in february, thanks for the work so far

tpadioleau avatar Jan 31 '25 17:01 tpadioleau

Hello,

I have updated the branch and tested with CUDA12.9+gcc13. Same behavior.

I have also tested to use a functor in place of lambda, same behavior.

image

So, no progress :/ Looks like a CUDA bug to me.

blegouix avatar Aug 25 '25 19:08 blegouix

@blegouix I will be finishing the work for this PR. I haven't been able to reproduce the bug you are reporting, even on a GTX1070, so it will be merged mostly as is.

PaulGannay avatar Nov 20 '25 10:11 PaulGannay

@PaulGannay ok thank you very much!

blegouix avatar Nov 20 '25 12:11 blegouix

@blegouix FYI, we have identified a behavior that looks like a bug. It appears in debug in loops. It has been reported to Nvidia.

tpadioleau avatar Dec 05 '25 12:12 tpadioleau

Ok thx for the feedback, I am not surprised, the issues I encountered were very suspect. I was not sure if it was a CUDA or a Kokkos bug but something was clearly wrong.

blegouix avatar Dec 05 '25 14:12 blegouix