Oceananigans.jl Distributed FFTs using Oceananigans' inhouse DiscreteTransforms

This PR removes the PencilFFT library from Oceananigans and builds a distributed FFT solver using Oceananigans' inhouse transforms. This allows us to run on GPUs both periodic and bounded domains. No stretched mesh is supported at the moment (that will come in a later PR)

The transposition is performed through a custom transpose routine built for Oceananigans' fields that assumes

the starting configuration is always a z-free configuration.
the transpose directions are z-free -> y-free -> x-free -> y-free -> x-free
the y-direction is integer divisible by the number of ranks that divide the x-direction
the z-direction is integer divisible by the number of ranks that divide the y-direction

An additional assumption is that:

if TY is Bounded, also TZ needs to be Bounded
if TX is Bounded, also TY needs to be Bounded

All these assumptions can be relaxed in following PRs

Sep 20 '23 06:09 simone-silvestri

Does this PR essentially discontinue support for distributed CPU transforms? (We need the permutations for compatibility with PencilArrays.)

Maybe a short description of the purpose and thinking behind the PR would help.

Sep 20 '23 11:09 glwagner

Nono, it's CPU and GPU compatible (discontinues PencilArrays). I'll write a description soon.

Sep 20 '23 11:09 simone-silvestri

Ok. That makes sense, since CPU is not our goal; we can accept some loss of performance on CPU in order to simplify the code.

The other question is why we are not implementing this in PencilArrays / PencilFFTs. Having an independent implementation may not be the best practice (we want to be good open source community members), but could be justified, maybe.

Sep 20 '23 11:09 glwagner

@matinraayai is working on making PencilArrays performant. This PR is exploratory and is a fallback that might not be merged if we find an efficient way to do GPU transposes with PencilArrays (requires reducing memory allocations and improving the efficiency of permute operations) and implement r2r Fourier transforms in PencilFFTs. For the moment those two elements are part of this PR.

This PR follows the (simple) configuration of the 2decomp library https://github.com/2decomp-fft/2decomp-fft, the difference between PencilFFT/PencilArray and this PR (a part bounded domain ffts) is that here (at the moment) we impose the stricter limitation that Ny has to be divisible by Rx and Ry while Nz has to be divisible by Ry, where Rx and Ry are the number of ranks (divisions) in the x and y direction. Relaxing the requirements should not be too difficult.

Sep 20 '23 12:09 simone-silvestri

@matinraayai is working on making PencilArrays performant. This PR is exploratory and is a fallback that might not be merged if we find an efficient way to do GPU transposes with PencilArrays (requires reducing memory allocations and improving the efficiency of permute operations) and implement r2r Fourier transforms in PencilFFTs. For the moment those two elements are part of this PR.

This PR follows the (simple) configuration of the 2decomp library https://github.com/2decomp-fft/2decomp-fft, the difference between PencilFFT/PencilArray and this PR (a part bounded domain ffts) is that here (at the moment) we impose the stricter limitation that Ny has to be divisible by Rx and Ry while Nz has to be divisible by Ry, where Rx and Ry are the number of ranks (divisions) in the x and y direction. Relaxing the requirements should not be too difficult.

Nice, thanks for that explanation.

Why are we following 2decomp? PencilArrays has some benchmarking that shows it can compete with the fastest codes out there. I don't see anything similar for 2decomp, so I can't figure out what the motivation for following that strategy would be. I'm not sure if they are different, either.

Something we do not previously support (but which is implemented in https://github.com/CliMA/Oceananigans.jl/pull/2538) was an algorithm that could support any topology with vertically-stretched grids. What is the relationship between this PR and https://github.com/CliMA/Oceananigans.jl/pull/2538, and does this PR support vertically stretched grids?

One of the main limitations of PencilArrays from our perspective is that it could not distribute an array along the first dimension. Since we almost always would like to use vertically stretched grids (and for various reasons, we may want to also compute the hydrostatic pressure with a vertical integral), ocean LES are typically distributed in x and y. Therefore, in order to support 2D domain decompositions, we were faced with somehow changing the underlying memory layout to make the z-dimension the "first" dimension (from the PencilArrays standpoint).

However, if we are going to abandon PencilArrays and pursue our own independent implementation, then we are free to support x, y decompositions and keep z as the local dimension.

What approach does this PR take with respect to distributing domains in x, y?

Sep 20 '23 19:09 glwagner

At the moment, this PR allows slab decomposition in x and y, and pencil decomposition in x, y. This PR implements these transposes:

z -> y
y -> x
x -> y 
y -> z

where the above letter stands for the "free" direction (the direction that is completely local). We always start from the z configuration because we do not want to support z-decompositions (because of possible vertical integrals and implicit vertical diffusion). The FFT based algorithm is

z_transform! 
z -> y
y_transform!
y -> x
x_transform!
division by λ
x_transform!
x -> y 
y_transform!
y -> z
z_transform!

The whole procedure needs 4 transposes for pencil decomposition and 2 for slab decomposition. I am still working on implementing the Fourier tridiagonal solver, which is very easily done naively by increasing the number of transposes:

z -> y
y_transform!
y -> x
x_transform!
x -> y 
y -> z
tridiagonal_solve!
z -> y
y_transform!
y -> x
x_transform!
x -> y
y -> z

This requires 8 transposes though, which would kill any possibility of scaling even if we completely fill the GPU memory. I would like to ensure that we always do a maximum of 4 transposes (2 forward and 2 backward). This means playing with memory in the background, I still have to do a little bit of thinking to ensure it, maybe I have to define other transposes like

x -> z
z -> x

which would remove two of them but it is slightly more complicated. This is because, in the previous transposes we switch partitioning between the in and out free direction while the third dimension remains untouched. On the other hand, z-configuration and x-configuration also have a differently partitioned y-direction which makes the transpose process between x and z non-trivial

Sep 22 '23 12:09 simone-silvestri

Should we put some preliminary benchmarking results here?

Dec 13 '23 17:12 glwagner

Difference between the pencilFFT implementation (first profile) and the results of this branch (second profile) Screenshot 2023-09-05 at 8 26 35 PM

Mar 08 '24 18:03 simone-silvestri

@simone-silvestri We can just use this instead and improve it. PencilFFTs is more trouble than its worth IMO.

Mar 08 '24 19:03 matinraayai

@simone-silvestri which files contain the FFT implementation?

Mar 08 '24 19:03 matinraayai

If you want to give it a try it's cool, I just tried reviving this branch which was a bit stale so probably there are going to be some errors to fix. There are mainly four files that deal with the distributed Poisson solver. The distributed FFT implementation is in DistributedComputations/distributed_fft_based_poisson_solver.jl where the plan setup is in DistributedComputations/plan_distributed_transforms.jl. The transpose is defined in DistributedComputations/transpose_parallel_fields.jl and its setup is in DistributedComputations/parallel_fields.jl.

Mar 08 '24 19:03 simone-silvestri

at the moment there the implementation is only for uniform meshes, we need to include an implementation for a stretched direction

Mar 08 '24 19:03 simone-silvestri

The implementation in DistributedComputations/distributed_fft_based_poisson_solver.jl is a full poisson solve though. If you are interested only in a three-dimensional distributed FFT that would be these lines https://github.com/CliMA/Oceananigans.jl/blob/44e354c0fda628db0715a31d17c5031b9e5e7ec8/src/DistributedComputations/distributed_fft_based_poisson_solver.jl#L141-L147

where the backward transform is done here https://github.com/CliMA/Oceananigans.jl/blob/44e354c0fda628db0715a31d17c5031b9e5e7ec8/src/DistributedComputations/distributed_fft_based_poisson_solver.jl#L161-L166

Mar 08 '24 19:03 simone-silvestri

The actual bottlenecks (as seen from the profile) are the transpose steps. The FFT steps are basically irrelevant. This file shows how to setup and perform the transposes https://github.com/CliMA/Oceananigans.jl/blob/ss/distributed-fft/test/test_distributed_transpose.jl

Mar 08 '24 19:03 simone-silvestri

Sorry for jumping in, but with this PR, will we be able to, for example, easily save outputs in spectrum form?

Jun 19 '24 14:06 iuryt

Do you mean saving outputs in Fourier space?

The objective of this PR is only to enable multi-GPU support for the nonhydrostatic model through MPI. Saving in Fourier space is independent of the single/multiple GPU framework and could be worked on as an independent PR. I am not very sure you would want such a feature though, maybe a feature to automatically build power spectra and cospectra would be nice

Jun 19 '24 14:06 simone-silvestri

Thanks for clarifying. Yes, I was thinking about being able to perform fit in any dimension you choose and save outputs like this. Putting the heavy load on the GPU.

Jun 19 '24 15:06 iuryt

Ready to be reviewed. I could not make the oceanic les regression test work for the moment, but we still have to build a bit on this PR so this is a good starting point to not blow up this PR too much.

Next steps should be

include an example (to be ran on the distributed pipeline)
include a tridiagonal solve
relax the constraints on the divisibility of the grid size by the ranks, this will require a bit of experimenting but it can be easily done by modifying the buffers in TransposableField to be of variable size

Jun 21 '24 19:06 simone-silvestri

@glwagner @navidcy

Jun 21 '24 19:06 simone-silvestri

I can review this but it's going to take a little while to go through the 34 files so I need a few days.

Jun 26 '24 16:06 glwagner

This PR removes the PencilFFT library from Oceananigans and builds a distributed FFT solver using Oceananigans' inhouse transforms. This allows us to run on GPUs both periodic and bounded domains. No stretched mesh is supported at the moment (that will come in a later PR)

The transposition is performed through a custom transpose routine built for Oceananigans' fields that assumes

the starting configuration is always a z-free configuration.

the transpose directions are z-free -> y-free -> x-free -> y-free -> x-free

the y-direction is integer divisible by the number of ranks that divide the x-direction

the z-direction is integer divisible by the number of ranks that divide the y-direction

An additional assumption is that:

if TY is Bounded, also TZ needs to be Bounded

if TX is Bounded, also TY needs to be Bounded

All these assumptions can be relaxed in following PRs

Are these statements still accurate?

Jun 28 '24 16:06 glwagner

We need at least a few comments about the specifics of the algorithm here... for example there are some index calculations like

xybuff.send[i + N[1] * (k-1 + N[3] * (j-1))] = yfield[i, j, k]

after staring at this I realize there is an index that is something like

idx = i + Nx * (k-1 + Nz * (j-1))

but I can't figure out where it comes from...

The inverse of this (perhaps) is also computed in unpacking.

Hopefully clarifying these will not be too difficult, its just a matter of adding a little documentation

Jun 28 '24 17:06 glwagner

Where does this PR leave #2538? As I understand it we do not treat vertically stretched grids here (or a mixed FFT/tridiagonal solver). So this moves laterally compared to #2538 and perhaps the algorithm implemented there can be adapted once this is merged?

Jun 28 '24 17:06 glwagner

Where does this PR leave #2538? As I understand it we do not treat vertically stretched grids here (or a mixed FFT/tridiagonal solver). So this moves laterally compared to #2538 and perhaps the algorithm implemented there can be adapted once this is merged?

I was planning was to implement the tridiagonal solve similarly to what we do for single GPU as such:

    transpose_z_to_y!(storage) # copy data from storage.zfield to storage.yfield
    solver.plan.forward.y!(parent(storage.yfield), buffer.y) 
    transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield
    solver.plan.forward.x!(parent(storage.xfield), buffer.x)
    transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield
    transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield
  
    # Perform the implicit vertical solve here on storage.zfield...

    transpose_z_to_y!(storage)
    solver.plan.backward.y!(parent(storage.yfield), buffer.y)
    transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield
    solver.plan.backward.y!(parent(storage.xfield), buffer.x)
    transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield
    transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield

This is not super great because it requires eight transposes, but I think it's the same thing that was happening in #2538

Jun 28 '24 17:06 simone-silvestri

Where does this PR leave #2538? As I understand it we do not treat vertically stretched grids here (or a mixed FFT/tridiagonal solver). So this moves laterally compared to #2538 and perhaps the algorithm implemented there can be adapted once this is merged?

I was planning was to implement the tridiagonal solve similarly to what we do for single GPU as such:
    transpose_z_to_y!(storage) # copy data from storage.zfield to storage.yfield
    solver.plan.forward.y!(parent(storage.yfield), buffer.y) 
    transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield
    solver.plan.forward.x!(parent(storage.xfield), buffer.x)
    transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield
    transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield
  
    # Perform the implicit vertical solve here on storage.zfield...

    transpose_z_to_y!(storage)
    solver.plan.backward.y!(parent(storage.yfield), buffer.y)
    transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield
    solver.plan.backward.y!(parent(storage.xfield), buffer.x)
    transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield
    transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield
This is not super great because it requires eight transposes, but I think it's the same thing that was happening in #2538

Can you just say, you are planning to implement the algorithm in #2538? The way you describe it, I can't figure out if you are coming up with something new or not.

Jun 28 '24 17:06 glwagner

I have added a bit of documentation for the transpose functions. Now everything should be a bit clearer

Jul 17 '24 14:07 simone-silvestri

suggest @navidcy takes a look

Jul 20 '24 05:07 glwagner

OK, I will. Sorry for slack here!

Jul 21 '24 01:07 navidcy

what do you think @navidcy, we merge?

Jul 26 '24 14:07 simone-silvestri

No stretched mesh is supported at the moment (that will come in a later PR)

Will an info/error been thrown in this case?

Jul 30 '24 21:07 navidcy

Oceananigans.jl Oceananigans.jl copied to clipboard

Distributed FFTs using Oceananigans' inhouse DiscreteTransforms

Oceananigans.jl
Oceananigans.jl copied to clipboard