Oceananigans.jl
Oceananigans.jl copied to clipboard
Distributed FFTs using Oceananigans' inhouse DiscreteTransforms
This PR removes the PencilFFT library from Oceananigans and builds a distributed FFT solver using Oceananigans' inhouse transforms. This allows us to run on GPUs both periodic and bounded domains. No stretched mesh is supported at the moment (that will come in a later PR)
The transposition is performed through a custom transpose
routine built for Oceananigans' fields that assumes
- the starting configuration is always a z-free configuration.
- the transpose directions are z-free -> y-free -> x-free -> y-free -> x-free
- the y-direction is integer divisible by the number of ranks that divide the x-direction
- the z-direction is integer divisible by the number of ranks that divide the y-direction
An additional assumption is that:
- if TY is Bounded, also TZ needs to be Bounded
- if TX is Bounded, also TY needs to be Bounded
All these assumptions can be relaxed in following PRs
Does this PR essentially discontinue support for distributed CPU transforms? (We need the permutations for compatibility with PencilArrays.)
Maybe a short description of the purpose and thinking behind the PR would help.
Nono, it's CPU and GPU compatible (discontinues PencilArrays). I'll write a description soon.
Ok. That makes sense, since CPU is not our goal; we can accept some loss of performance on CPU in order to simplify the code.
The other question is why we are not implementing this in PencilArrays / PencilFFTs. Having an independent implementation may not be the best practice (we want to be good open source community members), but could be justified, maybe.
@matinraayai is working on making PencilArrays performant. This PR is exploratory and is a fallback that might not be merged if we find an efficient way to do GPU transposes with PencilArrays (requires reducing memory allocations and improving the efficiency of permute operations) and implement r2r Fourier transforms in PencilFFTs. For the moment those two elements are part of this PR.
This PR follows the (simple) configuration of the 2decomp library https://github.com/2decomp-fft/2decomp-fft,
the difference between PencilFFT/PencilArray and this PR (a part bounded domain ffts) is that here (at the moment) we impose the stricter limitation that Ny
has to be divisible by Rx
and Ry
while Nz
has to be divisible by Ry
, where Rx
and Ry
are the number of ranks (divisions) in the x and y direction. Relaxing the requirements should not be too difficult.
@matinraayai is working on making PencilArrays performant. This PR is exploratory and is a fallback that might not be merged if we find an efficient way to do GPU transposes with PencilArrays (requires reducing memory allocations and improving the efficiency of permute operations) and implement r2r Fourier transforms in PencilFFTs. For the moment those two elements are part of this PR.
This PR follows the (simple) configuration of the 2decomp library https://github.com/2decomp-fft/2decomp-fft, the difference between PencilFFT/PencilArray and this PR (a part bounded domain ffts) is that here (at the moment) we impose the stricter limitation that
Ny
has to be divisible byRx
andRy
whileNz
has to be divisible byRy
, whereRx
andRy
are the number of ranks (divisions) in the x and y direction. Relaxing the requirements should not be too difficult.
Nice, thanks for that explanation.
Why are we following 2decomp? PencilArrays has some benchmarking that shows it can compete with the fastest codes out there. I don't see anything similar for 2decomp, so I can't figure out what the motivation for following that strategy would be. I'm not sure if they are different, either.
Something we do not previously support (but which is implemented in https://github.com/CliMA/Oceananigans.jl/pull/2538) was an algorithm that could support any topology with vertically-stretched grids. What is the relationship between this PR and https://github.com/CliMA/Oceananigans.jl/pull/2538, and does this PR support vertically stretched grids?
One of the main limitations of PencilArrays from our perspective is that it could not distribute an array along the first dimension. Since we almost always would like to use vertically stretched grids (and for various reasons, we may want to also compute the hydrostatic pressure with a vertical integral), ocean LES are typically distributed in x and y. Therefore, in order to support 2D domain decompositions, we were faced with somehow changing the underlying memory layout to make the z-dimension the "first" dimension (from the PencilArrays standpoint).
However, if we are going to abandon PencilArrays and pursue our own independent implementation, then we are free to support x, y decompositions and keep z as the local dimension.
What approach does this PR take with respect to distributing domains in x, y?
At the moment, this PR allows slab decomposition in x and y, and pencil decomposition in x, y. This PR implements these transposes:
z -> y
y -> x
x -> y
y -> z
where the above letter stands for the "free" direction (the direction that is completely local). We always start from the z
configuration because we do not want to support z-decompositions (because of possible vertical integrals and implicit vertical diffusion).
The FFT based algorithm is
z_transform!
z -> y
y_transform!
y -> x
x_transform!
division by Ξ»
x_transform!
x -> y
y_transform!
y -> z
z_transform!
The whole procedure needs 4 transposes for pencil decomposition and 2 for slab decomposition. I am still working on implementing the Fourier tridiagonal solver, which is very easily done naively by increasing the number of transposes:
z -> y
y_transform!
y -> x
x_transform!
x -> y
y -> z
tridiagonal_solve!
z -> y
y_transform!
y -> x
x_transform!
x -> y
y -> z
This requires 8 transposes though, which would kill any possibility of scaling even if we completely fill the GPU memory. I would like to ensure that we always do a maximum of 4 transposes (2 forward and 2 backward). This means playing with memory in the background, I still have to do a little bit of thinking to ensure it, maybe I have to define other transposes like
x -> z
z -> x
which would remove two of them but it is slightly more complicated. This is because, in the previous transposes we switch partitioning between the in and out free direction while the third dimension remains untouched. On the other hand, z-configuration and x-configuration also have a differently partitioned y-direction which makes the transpose process between x
and z
non-trivial
Should we put some preliminary benchmarking results here?
Difference between the pencilFFT implementation (first profile) and the results of this branch (second profile)
@simone-silvestri We can just use this instead and improve it. PencilFFTs is more trouble than its worth IMO.
@simone-silvestri which files contain the FFT implementation?
If you want to give it a try it's cool, I just tried reviving this branch which was a bit stale so probably there are going to be some errors to fix. There are mainly four files that deal with the distributed Poisson solver.
The distributed FFT implementation is in
DistributedComputations/distributed_fft_based_poisson_solver.jl
where the plan setup is in DistributedComputations/plan_distributed_transforms.jl
. The transpose is defined in DistributedComputations/transpose_parallel_fields.jl
and its setup is in DistributedComputations/parallel_fields.jl
.
at the moment there the implementation is only for uniform meshes, we need to include an implementation for a stretched direction
The implementation in DistributedComputations/distributed_fft_based_poisson_solver.jl
is a full poisson solve though. If you are interested only in a three-dimensional distributed FFT that would be these lines
https://github.com/CliMA/Oceananigans.jl/blob/44e354c0fda628db0715a31d17c5031b9e5e7ec8/src/DistributedComputations/distributed_fft_based_poisson_solver.jl#L141-L147
where the backward transform is done here https://github.com/CliMA/Oceananigans.jl/blob/44e354c0fda628db0715a31d17c5031b9e5e7ec8/src/DistributedComputations/distributed_fft_based_poisson_solver.jl#L161-L166
The actual bottlenecks (as seen from the profile) are the transpose steps. The FFT steps are basically irrelevant. This file shows how to setup and perform the transposes https://github.com/CliMA/Oceananigans.jl/blob/ss/distributed-fft/test/test_distributed_transpose.jl
Sorry for jumping in, but with this PR, will we be able to, for example, easily save outputs in spectrum form?
Do you mean saving outputs in Fourier space?
The objective of this PR is only to enable multi-GPU support for the nonhydrostatic model through MPI. Saving in Fourier space is independent of the single/multiple GPU framework and could be worked on as an independent PR. I am not very sure you would want such a feature though, maybe a feature to automatically build power spectra and cospectra would be nice
Thanks for clarifying. Yes, I was thinking about being able to perform fit in any dimension you choose and save outputs like this. Putting the heavy load on the GPU.
Ready to be reviewed. I could not make the oceanic les regression test work for the moment, but we still have to build a bit on this PR so this is a good starting point to not blow up this PR too much.
Next steps should be
- include an example (to be ran on the distributed pipeline)
- include a tridiagonal solve
- relax the constraints on the divisibility of the grid size by the ranks, this will require a bit of experimenting but it can be easily done by modifying the buffers in
TransposableField
to be of variable size
@glwagner @navidcy
I can review this but it's going to take a little while to go through the 34 files so I need a few days.
This PR removes the PencilFFT library from Oceananigans and builds a distributed FFT solver using Oceananigans' inhouse transforms. This allows us to run on GPUs both periodic and bounded domains. No stretched mesh is supported at the moment (that will come in a later PR)
The transposition is performed through a custom
transpose
routine built for Oceananigans' fields that assumes
- the starting configuration is always a z-free configuration.
- the transpose directions are z-free -> y-free -> x-free -> y-free -> x-free
- the y-direction is integer divisible by the number of ranks that divide the x-direction
- the z-direction is integer divisible by the number of ranks that divide the y-direction
An additional assumption is that:
- if TY is Bounded, also TZ needs to be Bounded
- if TX is Bounded, also TY needs to be Bounded
All these assumptions can be relaxed in following PRs
Are these statements still accurate?
We need at least a few comments about the specifics of the algorithm here... for example there are some index calculations like
xybuff.send[i + N[1] * (k-1 + N[3] * (j-1))] = yfield[i, j, k]
after staring at this I realize there is an index that is something like
idx = i + Nx * (k-1 + Nz * (j-1))
but I can't figure out where it comes from...
The inverse of this (perhaps) is also computed in unpacking.
Hopefully clarifying these will not be too difficult, its just a matter of adding a little documentation
Where does this PR leave #2538? As I understand it we do not treat vertically stretched grids here (or a mixed FFT/tridiagonal solver). So this moves laterally compared to #2538 and perhaps the algorithm implemented there can be adapted once this is merged?
Where does this PR leave #2538? As I understand it we do not treat vertically stretched grids here (or a mixed FFT/tridiagonal solver). So this moves laterally compared to #2538 and perhaps the algorithm implemented there can be adapted once this is merged?
I was planning was to implement the tridiagonal solve similarly to what we do for single GPU as such:
transpose_z_to_y!(storage) # copy data from storage.zfield to storage.yfield
solver.plan.forward.y!(parent(storage.yfield), buffer.y)
transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield
solver.plan.forward.x!(parent(storage.xfield), buffer.x)
transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield
transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield
# Perform the implicit vertical solve here on storage.zfield...
transpose_z_to_y!(storage)
solver.plan.backward.y!(parent(storage.yfield), buffer.y)
transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield
solver.plan.backward.y!(parent(storage.xfield), buffer.x)
transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield
transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield
This is not super great because it requires eight transposes, but I think it's the same thing that was happening in #2538
Where does this PR leave #2538? As I understand it we do not treat vertically stretched grids here (or a mixed FFT/tridiagonal solver). So this moves laterally compared to #2538 and perhaps the algorithm implemented there can be adapted once this is merged?
I was planning was to implement the tridiagonal solve similarly to what we do for single GPU as such:
transpose_z_to_y!(storage) # copy data from storage.zfield to storage.yfield solver.plan.forward.y!(parent(storage.yfield), buffer.y) transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield solver.plan.forward.x!(parent(storage.xfield), buffer.x) transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield # Perform the implicit vertical solve here on storage.zfield... transpose_z_to_y!(storage) solver.plan.backward.y!(parent(storage.yfield), buffer.y) transpose_y_to_x!(storage) # copy data from storage.yfield to storage.xfield solver.plan.backward.y!(parent(storage.xfield), buffer.x) transpose_x_to_y!(storage) # copy data from storage.xfield to storage.yfield transpose_y_to_z!(storage) # copy data from storage.yfield to storage.zfield
This is not super great because it requires eight transposes, but I think it's the same thing that was happening in #2538
Can you just say, you are planning to implement the algorithm in #2538? The way you describe it, I can't figure out if you are coming up with something new or not.
I have added a bit of documentation for the transpose
functions. Now everything should be a bit clearer
suggest @navidcy takes a look
OK, I will. Sorry for slack here!
what do you think @navidcy, we merge?
No stretched mesh is supported at the moment (that will come in a later PR)
Will an info/error been thrown in this case?