Daniel Arndt
Daniel Arndt
In addition to the points raised above, `Unroll` seems like a misnomer to me here. What we are really doing is to give every thread more work items. For some...
> I am not sure if unroll is a misnomer. From my understanding compiler is unrolling the loop since it has access to loop length at compile time even without...
> These names are fine with me. I prefer `StaticBatchSize` but the others work too. What do you think @masterleinad ? Maybe. I'm just curious what you want to do...
> My thinking is we would still change the number of workers that are involved in the `parallel_for`, similar to the way we did in CUDA (Essentially changing the hardware...
> Regarding your second comment, what would be the workaround for the overflow error? Something like ```C++ for (Member i = 0; ((i < static_cast(work_stride * batch_size)) && (i <...
https://github.com/kokkos/kokkos/issues/3044 is related.
Can you explain for every synchronization barrier added why they are necessary?
Since the HIP implementation for the most part is a copy of the Cuda implementation, I would expect that we need the same barriers for both backends.
```diff diff --git a/core/src/HIP/Kokkos_HIP_ParallelScan_Range.hpp b/core/src/HIP/Kokkos_HIP_ParallelScan_Range.hpp index ce9b35b0d..93f23dd4a 100644 --- a/core/src/HIP/Kokkos_HIP_ParallelScan_Range.hpp +++ b/core/src/HIP/Kokkos_HIP_ParallelScan_Range.hpp @@ -156,9 +156,6 @@ class ParallelScanHIPBase { iwork_base < range.end(); iwork_base += blockDim.y) { const typename Policy::member_type iwork...
Also note that the unit test doesn't use the result anyway.