Cabana Buffered Parallel For

It was indicated by the XGC team that they need to run more particles then available memory on the GPU. Using CUDA UVM allows for increased particles but the manual swapping due to page load overhead does not allow for performance. We propose a buffering strategy that overlaps computation and data fetching in a new parallel for construct. Some requirements:

User declares the memory space the data will be provided in
User declares the execution space in which the computation will be performed. This is compared against the memory space and if they are different (e.g. CPU memory and GPU computation) a buffering strategy is deployed
User declares the maximum number of tuples (particles) allowed to be allocated in the execution space. This should be a number that doesn't overflow memory in the compute space
User optionally provides the number of buffers used to break up computation and data movement
This will be employed in a new buffered_parallel_for/buffered_simd_parallel_for concept which will implement a fetch/compute/write strategy between the buffers
This should work for both Kokkos::RangePolicy and well as Cabana::SimdPolicy - we will handle the begin/end loops over partially filled SoAs
NOTE: The vector length of the input AoSoA must match that of the AoSoA that is performant in the execution space
NOTE: This will require the implementation of an AoSoA/Slice subview to be performant
NOTE: The design of this should conceptually be similar to Kokkos::ScatterView - create an object that manges the memory a user will access (i.e. AoSoA, Slice, Kokkos view) and then give users access to the active compute buffer in their functors.

Apr 24 '19 18:04 sslattery

FYI, though I think this is an important future feature, it is not urgent. After some improvements to the code (really just that my original version was using a lot more memory than needed), the planned production runs on Summit are possible with the existing code.

Apr 28 '19 14:04 ascheinb

Per Aaron's comments, Bob's current priorities, and shifting priorities with NVIDIA this will not be a priority in the coming month or 2.

May 02 '19 19:05 sslattery

@stanmoore1 relevant for Sparta?

May 16 '19 20:05 sslattery

Yes, SPARTA has the same issues as you mentioned for XGC. We can wait for this as well though--more I am trying to scope out the issue. Thanks

May 16 '19 20:05 stanmoore1

I'm working on a sample API and implementation for this here

Currently it does not work for GPUs, but is looking good for CPUs. More updates to come in the future

Aug 29 '19 20:08 rfbird

Cabana Cabana copied to clipboard

Buffered Parallel For

Cabana
Cabana copied to clipboard