This draft PR demonstrates an interface to allocate and utilize static MPI comms buffers for the gather/scatter routines in Cabana_Halo.hpp, as opposed to reallocating a new buffer with each call. The interface contains functions that can be used to query the necessary buffer size given the current comms situation, allocate buffers (including support for a multiplicative "overallocation" factor), and check if the current buffer is big enough (which may be an overcomplete set of functions).

The interface is backwards compatible in a way that preserves existing behavior: the existing gather/scatter routines not utilize the new routines "under the hood," dynamically allocating and freeing buffers with each call.

This PR is not merge ready, but it is functional. I am opening this draft PR as an opportunity to share the code and have a discussion on how to evolve it from here. Indeed, some parts of it are redundant (checks, patterns in the gather/scatter, routines that take buffers, etc), and I ran out of stream with documentation in later routines, all of which needs to be addressed.

A representative before/after, demonstrating a gather of atom coordinates:

Before:

Cabana::gather( *halo_all[phase], x );

After:

if (!gather_check_send_buffer(*halo_all[phase], x_send_buffer))
  x_send_buffer = Cabana::gather_preallocate_send_buffer(*halo_all[phase], x, 1.1);
if (!gather_check_recv_buffer(*halo_all[phase], x_recv_buffer))
  x_recv_buffer = Cabana::gather_preallocate_recv_buffer(*halo_all[phase], x, 1.1);

Cabana::gather( *halo_all[phase], x, x_send_buffer, x_recv_buffer);

In the "before" case, exactly sized send and receive buffers are allocated within gather each time it's called. On the other hand, in the later case, we carry around a static send (x_send_buffer) and receive (x_recv_buffer) buffer, and follow a standard pattern:

Check and see if the existing buffer is large enough (gather_check_[send/recv]_buffer); if not allocate a new one (gather_preallocate_[send/recv]_buffer) with an optional overallocation factor, here 1.1, to avoid frequent re-allocations.
Call a new variation of gather which takes in these send and receive buffers.

This pattern can be applied directly to the atom migration routines as well, I just hadn't gotten there yet. Since gather, scatter, and particle migration can all follow the same pattern of buffer check/reallocate if necessary/perform exchange, there's some scope to unify how buffer (re-)allocation is done.

Jul 05 '22 22:07 weinbe2

This is great! I went through a round of cleanup, but I still need to:

[x] Add this option to the comm benchmark
[x] Add this to the halo unit test
[x] Make sure there's nothing that would cause an issue for the same type of interfaces in particle migrate
[x] Run benchmarks

Jul 07 '22 21:07 streeve

Codecov Report

Merging #549 (b304012) into master (a43c919) will increase coverage by 0.0%. The diff coverage is 96.5%.

@@           Coverage Diff           @@
##           master    #549    +/-   ##
=======================================
  Coverage    94.9%   94.9%            
=======================================
  Files          47      47            
  Lines        5633    5734   +101     
=======================================
+ Hits         5346    5444    +98     
- Misses        287     290     +3

Impacted Files	Coverage Δ
core/src/Cabana_Halo.hpp	`95.1% <95.8%> (-0.3%)`	:arrow_down:
core/src/Cabana_CommunicationPlan.hpp	`98.5% <98.5%> (-0.1%)`	:arrow_down:
core/src/Cabana_Tuple.hpp	`100.0% <0.0%> (ø)`

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

Jul 07 '22 22:07 codecov[bot]

summit_buffer_speedup Here are the Summit results for gather (very similar for scatter) on a single rank. Each line is a different fraction of particles communicated. Very nice to see 4-10x speedup on a single V100 and 1-2x on the CPU (more clear trend where the largest sizes can nearly hide the allocation). Multi-rank tests showed more variability, but were similar overall

Jul 20 '22 21:07 streeve

@weinbe2 @sslattery this is a rewrite with new Gather/Scatter objects to handle the buffers for the user. Base classes should make the migrate versions easier (future PR). This is missing variadic slice options though

Jul 26 '22 16:07 streeve

New performance results (slices & scatter as well) - a little more noisy, but same overall gather_scatter_speedup

Jul 29 '22 20:07 streeve

Well I thought this was done, but I hit a segfault. Only on the 1 million particle performance test, but it's something in the last 3 commits

Aug 04 '22 22:08 streeve

@sslattery @weinbe2 Last issue I mentioned was just for the host/device benchmark case because of missing mirrors. No change in performance comparisons and ready for final review

Aug 18 '22 15:08 streeve

Cabana
Cabana copied to clipboard

Extending halo gather/scatter to allocate+utilize static MPI comms buffers

Codecov Report

Cabana Cabana copied to clipboard

Extending halo gather/scatter to allocate+utilize static MPI comms buffers

Codecov Report

Cabana
Cabana copied to clipboard