Cabana
Cabana copied to clipboard
Buffered Parallel For
It was indicated by the XGC team that they need to run more particles then available memory on the GPU. Using CUDA UVM allows for increased particles but the manual swapping due to page load overhead does not allow for performance. We propose a buffering strategy that overlaps computation and data fetching in a new parallel for construct. Some requirements:
- User declares the memory space the data will be provided in
- User declares the execution space in which the computation will be performed. This is compared against the memory space and if they are different (e.g. CPU memory and GPU computation) a buffering strategy is deployed
- User declares the maximum number of tuples (particles) allowed to be allocated in the execution space. This should be a number that doesn't overflow memory in the compute space
- User optionally provides the number of buffers used to break up computation and data movement
- This will be employed in a new
buffered_parallel_for
/buffered_simd_parallel_for
concept which will implement a fetch/compute/write strategy between the buffers - This should work for both
Kokkos::RangePolicy
and well asCabana::SimdPolicy
- we will handle the begin/end loops over partially filled SoAs - NOTE: The vector length of the input AoSoA must match that of the AoSoA that is performant in the execution space
- NOTE: This will require the implementation of an AoSoA/Slice subview to be performant
- NOTE: The design of this should conceptually be similar to
Kokkos::ScatterView
- create an object that manges the memory a user will access (i.e. AoSoA, Slice, Kokkos view) and then give users access to the active compute buffer in their functors.
FYI, though I think this is an important future feature, it is not urgent. After some improvements to the code (really just that my original version was using a lot more memory than needed), the planned production runs on Summit are possible with the existing code.
Per Aaron's comments, Bob's current priorities, and shifting priorities with NVIDIA this will not be a priority in the coming month or 2.
@stanmoore1 relevant for Sparta?
Yes, SPARTA has the same issues as you mentioned for XGC. We can wait for this as well though--more I am trying to scope out the issue. Thanks
I'm working on a sample API and implementation for this here
Currently it does not work for GPUs, but is looking good for CPUs. More updates to come in the future