Cabana
Cabana copied to clipboard
LinkedCellList::build performance enhacements
1. Extra deep copy
In LinkedCellList::build
, after computing offsets, the cell counters are reset Kokkos::deep_copy( counts, 0 )
But this is an overhead for the sake later usage of Kokkos::atomic_fetch_add
atomic_fetch_sub
Instead of the above, counts
should not be reset, but instead it should be reused with the help of Kokkos::atomic_fetch_sub
:
int c = Kokkos::atomic_fetch_sub( &counts( cell_id ), 1 ) - 1;
permute( offsets( cell_id ) + c ) = p;
EDIT: ignore this as I see below you need counts for BinningData
2. Always allocating memory
In unsteady simulations often construction of Views has significant impact. I advise counts
, offsets
, and permute
to be member variables that are only resized when needed.
3. Atomic exchange
Linked lists can be built using atomic exchange [1], and maybe instead current scheme (atomic per particle + scan + atomic per particle), one can get (atomic + scan per linked list + permute per cell). But maybe this is not a great idea since the memory access would be more random.
Btw great project! I'll start making pull requests instead of issues asap when I get some time to contribute. Cheers, Josip
@j8asic Thanks for the input! We really appreciate you checking out the code and providing feedback. I think you have some good ideas on performance here. We have a performance test that uses the linked cell list in the context of neighbor lists here: https://github.com/ECP-copa/Cabana/blob/master/core/performance_test/benchmark/md_neighbor_perf_test.cpp
We can use this to test your ideas. I think your memory allocation idea would be the easiest. Are you rebuilding the cell list every timestep or some subset of timesteps? Is this for molecular dynamics, PIC, or some other algorithm?
Thanks!
Thanks @sslattery, I still need to read the code (which is neatly written!) to get the full idea, and I see we're doing things similarly (Kokkos as well). The main difference is I deprecated building neighbour lists - I just reiterate linked-list neighbours on interaction and sort the data once in a while (SoA). Simulating incompressible motion, neighbourhoods change each time step but not intensively, so I tried to keep simplicity and cache hits. Does Cabana offer controlling of reordering data? Also distributor is similar but with 1D slicing. I would love to see benchmarks on how the AoSoA reordering and neighbour lists play role in a larger-scale simulation. Am willing to transfer my code and contribute. Do you have any communication channels not to contaminate GH issues with discussions?
Primary improvement already made - other ideas worth pursing at some point