Oceananigans.jl icon indicating copy to clipboard operation
Oceananigans.jl copied to clipboard

Distributed mixed FFT / vertical tridiagonal solver

Open glwagner opened this issue 3 years ago • 11 comments

This PR builds off #2536 and implements a distributed Poisson solver that users horizontal FFTs and a vertical tridiagonal solve, with more help from @jipolanco.

When distributed in (x, y), this kind of solver is more expensive than a pure FFT-based solver, because it requires 4 additional transpositions + communication.

For problems that are only distributed in x or y (eg, slab decomposition), we can avoid the additional transpositions. ~~Implementing that optimization is TODO for this PR.~~

Some of the details are discussed on https://github.com/jipolanco/PencilFFTs.jl/issues/44.

Future work, which would require abstracting the implementation of hydrostatic pressure in NonhydrostaticModel (and, for friendliness, forbidding the use of VerticallyImplicitTimeDiscretization), could in principle support a more efficient version of this solver with pencil decomposition in (y, z) or (x, z). This memory layout would increase performance for very large problems that require a 2D domain decomposition, since decomposing in (y, z) or (x, z) reduces the number of transposes needed by 4 over (x, y). This feature is easy to code, but might take some time to test. We've already noticed on #1910 that lumping hydrostatic and nonhydrostatic pressure produces different (perhaps lower quality) solutions.

TODO:

  • [x] Implement a more efficient algorithm for 1D "slab" decompositions
  • [x] Add tests

glwagner avatar May 08 '22 15:05 glwagner

@jipolanco I just send up another big commit --- I realized that we could use extra_dims when we use a 1D process grid + tridiagonal to save much communication:

https://github.com/CliMA/Oceananigans.jl/blob/e1cac85ff8fdd9032549b1a3c32569bc71a92c1e/src/Distributed/distributed_fft_based_poisson_solver.jl#L307-L314

and

https://github.com/CliMA/Oceananigans.jl/blob/e1cac85ff8fdd9032549b1a3c32569bc71a92c1e/src/Distributed/distributed_fft_based_poisson_solver.jl#L162-L183

Let me know what you think.

glwagner avatar May 08 '22 20:05 glwagner

@jipolanco I just send up another big commit --- I realized that we could use extra_dims when we use a 1D process grid + tridiagonal to save much communication:

I can't look at the details right now, but that sounds like a good option. To be honest, I haven't really used extra_dims and I was thinking about actually removing it :smile: But if you find it useful then we'll keep it there. Let me know if you find any issues.

jipolanco avatar May 09 '22 08:05 jipolanco

@jipolanco I just send up another big commit --- I realized that we could use extra_dims when we use a 1D process grid + tridiagonal to save much communication:

I can't look at the details right now, but that sounds like a good option. To be honest, I haven't really used extra_dims and I was thinking about actually removing it 😄 But if you find it useful then we'll keep it there. Let me know if you find any issues.

Ok! It seems convenient for 1D process grids / slab decomposition that use a tridiagonal solve along the third dimension rather than a transform. But we'll see...

glwagner avatar May 09 '22 11:05 glwagner

@simone-silvestri this may pass

glwagner avatar May 09 '22 20:05 glwagner

@glwagner it seems the only reason the tests aren't passing is because mpiexecjl isn't properly linked:

/bin/bash: /storage5/buildkite-agent/.julia-7523/bin/mpiexecjl: No such file or directory 

Maybe fix that and merge since (apparently) this PR is otherwise ready to go?

tomchor avatar Jul 18 '22 19:07 tomchor

There's a less trivial error here: https://buildkite.com/clima/oceananigans/builds/7523#311535ff-f56d-410e-8571-c3b1d9757daf

I'll try to restart the whole build and see what happens.

glwagner avatar Jul 19 '22 15:07 glwagner

Just realized the distributed tests have been running for 6 days. I guess it's fair to say there's still something to fix lol

Just killed it to save resources

tomchor avatar Jul 25 '22 14:07 tomchor

@tomchor are you able to test locally? I believe these passed locally for me, so the problem might be relatively easy to solve.

glwagner avatar Jul 25 '22 15:07 glwagner

@tomchor are you able to test locally? I believe these passed locally for me, so the problem might be relatively easy to solve.

I've never tested anything in parallel locally, but I can definitely try

tomchor avatar Jul 25 '22 15:07 tomchor

@glwagner I ran the tests and they got stuck in the same place where this test got stuck. So it appears that there's something to be fixed here...

tomchor avatar Jul 25 '22 22:07 tomchor

@glwagner how do you run locally? do you use mpiexecjl?

navidcy avatar Sep 19 '22 00:09 navidcy