Joachim
Joachim
I will merge this one once the previous PR has been merged.
Everything is working as (not) intended.
No, not random, but if you encounter a value that is not in order, then the value should be eliminated.
I see what you have done here. However, I don't see how we can include this into the repo.
So should we propose the python committee that they should rename the `__builtins__ `to `__PIGS__ `?
For the case without the fence (in my first simplified example), each GPU utilisation is near 100%. It is only when I introduce the fence that each GPU drops to...
``` Compiler: KOKKOS_COMPILER_GNU: 1130 KOKKOS_COMPILER_NVCC: 1180 Architecture: CPU architecture: none Default Device: N6Kokkos4CudaE GPU architecture: AMPERE80 platform: 64bit Atomics: Vectorization: KOKKOS_ENABLE_PRAGMA_IVDEP: no KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no KOKKOS_ENABLE_PRAGMA_UNROLL: no KOKKOS_ENABLE_PRAGMA_VECTOR: no Memory: KOKKOS_ENABLE_HBWSPACE:...
Yes I cropped the first line, sorry, and removed duplicate from the print (4 concurrent MPI). Here are the GPU processes from `nvidia-smi`: ``` +-----------------------------------------------------------------------------+ | Processes: | | GPU...
I confirm that each rank selects a different GPU. Yes, I know that kernel is tiny. As you said, on 1 GPU I see a difference about 2.2s (32.2s vs...
I have devised another test, which doesn't require explicit fencing, and which shows the same behavior. I just added a parallel reduction after the first kernel. ```C++ #include #include #include...