Optimised Laplace solver using Shared-memory MPI?
Edit: The original proposal in this top comment seems impractical (see discussion below), but shared-memory MPI might be useful for a Laplace solver optimisation https://github.com/boutproject/BOUT-dev/issues/3164#issuecomment-3244475961.
For another project, I've been working a lot with shared-memory MPI. I think it might apply fairly straightforwardly to BOUT++ and possibly provide some useful speed up, with relatively localised code changes (only at low-level in BOUT++, not changing any PhysicsModel code).
Proposal:
- Pick some groups of processes that would share memory - either a node or a subset of a node (e.g. a NUMA region, although my experience so far is that restricting shared-memory arrays to NUMA regions is surprisingly unimportant, so might be worth defaulting to just full nodes??).
- Allocate the memory backing Field3D and Field2D using
MPI_Win_allocate_shared(https://docs.open-mpi.org/en/main/man-openmpi/man3/MPI_Win_allocate_shared.3.html). MPI will want to assign the local 'view' as non-overlapping slices of the full shared-memory array, but BOUT++ probably wants to modify this to include also the guard cells that are 'owned' by neighbouring processes - this should be possible but might require callingMPI_Win_shared_query(https://docs.open-mpi.org/en/v5.0.4/man-openmpi/man3/MPI_Win_shared_query.3.html) after theMPI_Win_allocate_shared. The pattern I used in my other project was to allocate with all indices 'owned' by the root process of the shared-memory communicator, then useMPI_Win_shared_queryto select the actual index ranges that are wanted by each process. - In
Mesh::communicate()guard cells that are in shared-memory do not need to actually exchange data, but anMPI.Barrieris probably needed to synchronize the MPI ranks on the shared-memory communicators to avoid race conditions.
I think the MPI shared memory would be a fairly straightforward addition. At present processes should never modify guard cells (not counting boundary cells, which would not be shared between different processes anyway), so as long as Mesh::communicate() synchronises the shared-memory communicators, there should be no danger of race conditions.
This kind of shared-memory should (I think) be composable with OpenMP threads, so should be just an extra optimisation and not restrict any other features.
The main initial benefit would be to remove the need for copies to fill guard cells within an HPC node. This may not be the rate-limiting thing - the cost might well be hidden by the time spent in inter-node MPI communications (although for example 2D transport simulations that fit on a single node might see a bigger benefit). It would also give the potential for other optimisations. For example in FFT-based Laplace solvers if the x-direction is local to a single node, you could parallelise the tridiagonal x-direction solves over z instead of x without needing to do any communication (it would also be possible to generalise to allow the x-domain to extend over multiple nodes with the same algorithm as the current distributed-memory MPI implementation, not sure how much more complicated it would be to include that on a first pass).
There is a risk of creating race conditions, which can be very annoying to debug (e.g. intermittently incorrect results, and occasional segfaults), but I think the risk is low (as mentioned above).
I won't have any chance to work on this, but would guess that it's maybe an O(1) month project for an RSE (plus extra if you wanted to add shared-memory optimisation for Laplace solvers, etc.).
That sounds interesting, but is it possible to map the different parts of a field3d, such that it appears as a contiguous array in memory, when it is not? Otherwise this will require the consistent use of BOUT_FOR and not triple-for loops.
I think @ZedThree will soon work on this, as it is also required to allow splitting in the z-direction. With that it should require only modifications to the regions, and the communication pattern? However, I think there will be several spots in the code, where there are assumptions, that will no longer be true, if the storage pattern will be different ...
@dschwoerer - that is a good point about what is a contiguous array in memory. My proposal would work in 1D, but probably doesn't in 2D. Need to think harder, but seems likely this is a killer for a simple implementation :cry:
After some more thought, @dschwoerer's point is a very good one. In my other project I just wanted access to the full shared-memory block on every process, and managed the accesses by hand. I assumed that MPI would have some mechanism to support multi-dimensional arrays, but it seems that it does not. I think that means that using MPI shared memory would be much more complicated than I first thought. You would have to do something like extract multiple pointers to different contiguous chunks of the global array for different x-indices (and different y-indices if distributed z-direction is implemented one day).
Maybe there are workarounds, but I suspect they would be a bit restrictive, so probably conflict with the idea of BOUT++ as a general-purpose framework. For example, if the main thing we wanted to optimise was the Laplace inversion then we could:
- Assume that the z-direction is not distributed
- Add shared-memory parallelism only in the x-direction. This makes the shared-memory setup one-dimensional - the y-direction communication and guard cells stay the same as at present. The shared memory block would have a globally contiguous array, and each process within the block has a contiguous locally-owned slice.
- This setup would be enough (I think) to parallelise FFT-based Laplace solves over z instead of x, but it's much more single-purpose than what I originally thought might be possible. I suspect this would make this too brittle to be really a good idea to implement - it seems like it would be a risk that future upgrades/enhancements/features might break the assumptions required for this to work, which might be a headache. So now coming down to the idea that this would be premature optimisation, at least to put in at the low level I originally proposed.
Maybe there could be more use in a more restricted implementation? Make a new Laplace solver implementation that is a copy of cyclic, but allocates its own shared-memory buffers that include as many ranks as possible in the x-direction. Then copy in the rhs from the Field3D/FieldPerp at the start of the solve, and copy out the result at the end. The shared-memory does not have to infect the whole code (so race conditions can only happen within this Laplace solver, which is probably a good thing!), and the solver can just be disabled in case the z-direction is distributed or other things are incompatible. I think the main benefit of this would be for 3d simulations, by allowing parallelisation over z, but could also help 2d simulations a bit as we could also parallelise over the local y-grid points. Will change the title of this issue to reflect this updated proposal...
Some very brief thoughts that may or may not be relevant:
- it's possible to create custom datatypes in MPI that can deal with the non-contiguous guard cells
- MPI very likely uses shared memory under the hood for transfers within a NUMA region, although some copying is still required to/from the buffer, so potential for speed-up may be limited to removing these copies
- as @dschwoerer says, one goal of the IPP project is to get
BOUT_FOReverywhere so that we can add z-guards and parallelise in all directions
Another fun thing I've been wanting to experiment with in terms of custom allocators is to allocate all the evolving variables in one big block that can be shared with the time solver so we skip all copying to/from the solver.
Interesting points @ZedThree. Probably needs some performance analysis of the cyclic Laplace solves first to figure out if the cost of the MPI calls is enough to be worth trying to optimise!