Oceananigans.jl
Oceananigans.jl copied to clipboard
[WIP] Extend MultiRegion to NonhydrostaticModel
Now that finally MultiRegion
is merged we can implement the single node multi GPU paradigm also in the Nonhydrostatic model
cc @tomchor
The work can be divided in three tasks
- [x] Adapt the NonhydrostaticModel to accept a
MultiRegionGrid
. i.e., wrap local function calls in@apply_regionally
and extend global methods inmulti_region_models.jl
. - [ ] Expose the parallelism in
RungeKutta3
timestepper and in theupdate_state!
method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in@apply_regionally
- [ ] Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the
Distributed
module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)
The first two tasks are quite trivial so I think the bulk of the work will be on implementing the pressure solver
Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally
This is not strictly necessary right? Just if we want to also support RungeKutta3.
you're right. But it is quite easy to do so
Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)
I think (but am not 100% sure) that PencilArrays is tied to MPI. So I guess for 1 we are either "borrowing" (but not using directly) the transpose algorithm from PencilArrays, or we are extending the code so that it works "without MPI" -- and also with GPUs, which is work in progress: https://github.com/jipolanco/PencilFFTs.jl/pull/37
Another reason to focus on 2 is that we can use PencilFFTs for distributed (but not MultiRegion
) nonhydrostatic model. So even if PencilFFTs had support for CuArray now (and if Distributed
were performant for multi GPU --- both of which may not be too close), using cufftxt could still potentially be motivated by performance reasons.
In other words, if we develop a capability with cufftxt, then in the future we can also support multi GPU via PencilFFT and Distributed
, and we have two options (one perhaps performant on single node, multi GPU, and another to use when you need GPUs spread across multiple nodes).
Here's some description of cufftxt:
https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms
A new idea for extending the pressure solver to MultiRegion
is to divide the FFT computations into "local" and "non-local" directions.
The FFT in local directions can be easily performed wrapping the transform in @apply_regionally
For non local directions, if storage
and plan
are unified_array
s, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid
without having to transpose memory (that will happen under the hood thanks to unified memory).
This strategy would not be easily extensible to generally subdivided regions and will play well only with one direction partitioning, but given the current limitations of the FFT solver (only regular grid), I think it is a good strategy to get something to work
For non local directions, if
storage
andplan
areunified_array
s, the non local FFT can be performed by permuting the partitioning of theMultiRegionGrid
without having to transpose memory (that will happen under the hood thanks to unified memory).
Just storage
is an array; plan
is a CUFFT
object, not an array.
If we use cufftxt
would this happen be default?
ie with cufftxt
we have to build a unified storage
(and maybe unified eigenvalues). Then provided we can fill up storage correctly, and empty it correctly at the end, the thing that's left is to "just do" the fft (transposes etc handled under the hood).
Does broadcasting work with unified memory arrays?
yep broadcasting works. My thought was that plan
can be hacked to store unified memory. I still have to look at the data structure to see how to do it.
cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf
yep broadcasting works. My thought was that
plan
can be hacked to store unified memory. I still have to look at the data structure to see how to do it.cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf
Mmm ok. Is this proposal a way to avoid cuffxt basically? I think what you outlined is somehow roughly how PencilFTTs work:
- FFT along local direction (dim=1)
- Simultaneously communicate and permute data to (2, 1, 3) (or is it (2, 3, 1)?)
- FFT along local direction (dim=2)
- Simultaneously communicate and permute data to (3, 2, 1)
- FFT along dim=3.
At the end, the data has permutation (3, 2, 1). The backwards transform then reverses this process. solver.storage
is actually a tuple of 3 preallocated arrays to support this algorithm.
For the tridiagonal solver I think we want to use the same algorithm, except that we skip step 1 (ie the first step is to communicate and permute data with no transform). Once the two other transforms are complete we have data in the configuration (x, y, z) where z is local, and we can do a tridiagonal solve in eigenfunction space. Then we transform back and obtain data back in the (z, y, x) permutation, with z local, and copy into the pressure.
We have to extend the tridiagonal solver to accomodate this kind of permutation for distributed CPU, so if we have an algorithm like the one above we can then also use it for MultiRegionGrid solves on the GPU.
Exactly, in practice a transpose is communication + permutation, but with unified_memory
we avoid the communication part (I mean it is there but it's handled by CUDA)
Exactly, in practice a transpose is communication + permutation, but with
unified_memory
we avoid the communication part (I mean it is there but it's handled by CUDA)
So maybe we need three unified memory arrays for solver.storage
, each permuted with respect to one another?
I was thinking it'd be nice to avoid coding it ourselves by using cufftxt, but now that we're talking about it doesn't seem too difficult.
Is this PR superseded by #2795?
Is this PR superseded by #2795?
I think so. Although somehow this one seems to have fewer conflicts...
Also in my opinion we should implement a distributed GPU nonhydrostatic model first. MultiRegion is a cherry on top for sure, but not essential.
yeah, #2795 does supersede this PR, although #2795 has a custom implementation of transpose + FFT which we might want to avoid now that pencilFFT supports CuArrays
Stale and superseded by #2795, I ll close