Oceananigans.jl icon indicating copy to clipboard operation
Oceananigans.jl copied to clipboard

[WIP] Extend MultiRegion to NonhydrostaticModel

Open simone-silvestri opened this issue 2 years ago β€’ 10 comments

Now that finally MultiRegion is merged we can implement the single node multi GPU paradigm also in the Nonhydrostatic model

cc @tomchor

The work can be divided in three tasks

  • [x] Adapt the NonhydrostaticModel to accept a MultiRegionGrid. i.e., wrap local function calls in @apply_regionally and extend global methods in multi_region_models.jl.
  • [ ] Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally
  • [ ] Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)

The first two tasks are quite trivial so I think the bulk of the work will be on implementing the pressure solver

simone-silvestri avatar May 05 '22 14:05 simone-silvestri

Expose the parallelism in RungeKutta3 timestepper and in the update_state! method. This is achieved by lumping together local function calls (all possible kernel calls such as calculate tendencies, rk substeps, etc) in outer functions and wrapping everything in @apply_regionally

This is not strictly necessary right? Just if we want to also support RungeKutta3.

glwagner avatar May 05 '22 15:05 glwagner

you're right. But it is quite easy to do so

simone-silvestri avatar May 05 '22 15:05 simone-silvestri

Implement a multi GPU pressure solver. This can be achieved in a couple of different ways. (1) transpose local memory and perform one direction FFT at the time (at we do now in the Distributed module through PencilArrays). (2) exploit the multi GPU capabilities of cuda through the cufftxt library that can perform single node distributed FFT to up to 16 GPUs. (3) Allocate storage and plan in Unified memory and perform the FFT in only one GPU. Ideally we would implement (3) only if we are desperate. The best solution would be to go with method (2), as (1) incurs in hefty memory transfer costs (I am not sure as to how the cufftxt implements multi GPU FFT though)

I think (but am not 100% sure) that PencilArrays is tied to MPI. So I guess for 1 we are either "borrowing" (but not using directly) the transpose algorithm from PencilArrays, or we are extending the code so that it works "without MPI" -- and also with GPUs, which is work in progress: https://github.com/jipolanco/PencilFFTs.jl/pull/37

Another reason to focus on 2 is that we can use PencilFFTs for distributed (but not MultiRegion) nonhydrostatic model. So even if PencilFFTs had support for CuArray now (and if Distributed were performant for multi GPU --- both of which may not be too close), using cufftxt could still potentially be motivated by performance reasons.

In other words, if we develop a capability with cufftxt, then in the future we can also support multi GPU via PencilFFT and Distributed, and we have two options (one perhaps performant on single node, multi GPU, and another to use when you need GPUs spread across multiple nodes).

glwagner avatar May 05 '22 15:05 glwagner

Here's some description of cufftxt:

https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms

glwagner avatar May 05 '22 15:05 glwagner

A new idea for extending the pressure solver to MultiRegion is to divide the FFT computations into "local" and "non-local" directions.

The FFT in local directions can be easily performed wrapping the transform in @apply_regionally For non local directions, if storage and plan are unified_arrays, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid without having to transpose memory (that will happen under the hood thanks to unified memory).

This strategy would not be easily extensible to generally subdivided regions and will play well only with one direction partitioning, but given the current limitations of the FFT solver (only regular grid), I think it is a good strategy to get something to work

simone-silvestri avatar May 06 '22 17:05 simone-silvestri

For non local directions, if storage and plan are unified_arrays, the non local FFT can be performed by permuting the partitioning of the MultiRegionGrid without having to transpose memory (that will happen under the hood thanks to unified memory).

Just storage is an array; plan is a CUFFT object, not an array.

If we use cufftxt would this happen be default?

ie with cufftxt we have to build a unified storage (and maybe unified eigenvalues). Then provided we can fill up storage correctly, and empty it correctly at the end, the thing that's left is to "just do" the fft (transposes etc handled under the hood).

Does broadcasting work with unified memory arrays?

glwagner avatar May 06 '22 18:05 glwagner

yep broadcasting works. My thought was that plan can be hacked to store unified memory. I still have to look at the data structure to see how to do it.

cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf

simone-silvestri avatar May 06 '22 18:05 simone-silvestri

yep broadcasting works. My thought was that plan can be hacked to store unified memory. I still have to look at the data structure to see how to do it.

cufftxt basically works in the same way (local FFT direction distributed among the workers then transpose and nonlocal FFT). I am not sure they use unified memory but they for sure use transposes https://on-demand.gputechconf.com/gtc/2014/presentations/S4788-rapid-multi-gpu-programming-cuda-libraries.pdf

Mmm ok. Is this proposal a way to avoid cuffxt basically? I think what you outlined is somehow roughly how PencilFTTs work:

  1. FFT along local direction (dim=1)
  2. Simultaneously communicate and permute data to (2, 1, 3) (or is it (2, 3, 1)?)
  3. FFT along local direction (dim=2)
  4. Simultaneously communicate and permute data to (3, 2, 1)
  5. FFT along dim=3.

At the end, the data has permutation (3, 2, 1). The backwards transform then reverses this process. solver.storage is actually a tuple of 3 preallocated arrays to support this algorithm.

For the tridiagonal solver I think we want to use the same algorithm, except that we skip step 1 (ie the first step is to communicate and permute data with no transform). Once the two other transforms are complete we have data in the configuration (x, y, z) where z is local, and we can do a tridiagonal solve in eigenfunction space. Then we transform back and obtain data back in the (z, y, x) permutation, with z local, and copy into the pressure.

We have to extend the tridiagonal solver to accomodate this kind of permutation for distributed CPU, so if we have an algorithm like the one above we can then also use it for MultiRegionGrid solves on the GPU.

glwagner avatar May 06 '22 18:05 glwagner

Exactly, in practice a transpose is communication + permutation, but with unified_memory we avoid the communication part (I mean it is there but it's handled by CUDA)

simone-silvestri avatar May 06 '22 18:05 simone-silvestri

Exactly, in practice a transpose is communication + permutation, but with unified_memory we avoid the communication part (I mean it is there but it's handled by CUDA)

So maybe we need three unified memory arrays for solver.storage, each permuted with respect to one another?

I was thinking it'd be nice to avoid coding it ourselves by using cufftxt, but now that we're talking about it doesn't seem too difficult.

glwagner avatar May 06 '22 18:05 glwagner

Is this PR superseded by #2795?

navidcy avatar Jul 01 '23 10:07 navidcy

Is this PR superseded by #2795?

I think so. Although somehow this one seems to have fewer conflicts...

tomchor avatar Jul 01 '23 17:07 tomchor

Also in my opinion we should implement a distributed GPU nonhydrostatic model first. MultiRegion is a cherry on top for sure, but not essential.

glwagner avatar Jul 05 '23 16:07 glwagner

yeah, #2795 does supersede this PR, although #2795 has a custom implementation of transpose + FFT which we might want to avoid now that pencilFFT supports CuArrays

simone-silvestri avatar Jul 11 '23 07:07 simone-silvestri

Stale and superseded by #2795, I ll close

simone-silvestri avatar Sep 05 '23 12:09 simone-silvestri