Cabana icon indicating copy to clipboard operation
Cabana copied to clipboard

Buffered Parallel For

Open sslattery opened this issue 5 years ago • 5 comments

It was indicated by the XGC team that they need to run more particles then available memory on the GPU. Using CUDA UVM allows for increased particles but the manual swapping due to page load overhead does not allow for performance. We propose a buffering strategy that overlaps computation and data fetching in a new parallel for construct. Some requirements:

  • User declares the memory space the data will be provided in
  • User declares the execution space in which the computation will be performed. This is compared against the memory space and if they are different (e.g. CPU memory and GPU computation) a buffering strategy is deployed
  • User declares the maximum number of tuples (particles) allowed to be allocated in the execution space. This should be a number that doesn't overflow memory in the compute space
  • User optionally provides the number of buffers used to break up computation and data movement
  • This will be employed in a new buffered_parallel_for/buffered_simd_parallel_for concept which will implement a fetch/compute/write strategy between the buffers
  • This should work for both Kokkos::RangePolicy and well as Cabana::SimdPolicy - we will handle the begin/end loops over partially filled SoAs
  • NOTE: The vector length of the input AoSoA must match that of the AoSoA that is performant in the execution space
  • NOTE: This will require the implementation of an AoSoA/Slice subview to be performant
  • NOTE: The design of this should conceptually be similar to Kokkos::ScatterView - create an object that manges the memory a user will access (i.e. AoSoA, Slice, Kokkos view) and then give users access to the active compute buffer in their functors.

sslattery avatar Apr 24 '19 18:04 sslattery

FYI, though I think this is an important future feature, it is not urgent. After some improvements to the code (really just that my original version was using a lot more memory than needed), the planned production runs on Summit are possible with the existing code.

ascheinb avatar Apr 28 '19 14:04 ascheinb

Per Aaron's comments, Bob's current priorities, and shifting priorities with NVIDIA this will not be a priority in the coming month or 2.

sslattery avatar May 02 '19 19:05 sslattery

@stanmoore1 relevant for Sparta?

sslattery avatar May 16 '19 20:05 sslattery

Yes, SPARTA has the same issues as you mentioned for XGC. We can wait for this as well though--more I am trying to scope out the issue. Thanks

stanmoore1 avatar May 16 '19 20:05 stanmoore1

I'm working on a sample API and implementation for this here

Currently it does not work for GPUs, but is looking good for CPUs. More updates to come in the future

rfbird avatar Aug 29 '19 20:08 rfbird