Trixi.jl icon indicating copy to clipboard operation
Trixi.jl copied to clipboard

Initial support for MPI parallelization

Open sloede opened this issue 3 years ago • 6 comments

This issue is to keep track of the necessary steps towards an initial parallel octree implementation in Trixi. It should thus be amended as we progress and gather more experience about which ideas work and which don't. The general goal of the initial MPI implementation is to support nearly all of Trixi's current features in parallel and in the simplest possible way (at the cost of scalability).

  • [x] WP1: Parallel 2D simulations on static, uniform mesh (#167) (achievements: basic infrastructure for parallelization in place, parallelized DG solver)
    • [x] Create redundant octree with concept of cells belonging to different MPI subdomains
    • [x] Simple partitioning of the octree (i.e., assign cells to different domains, support for 2^n MPI ranks)
    • [x] Add mesh type as new parameter to DG solver to be able to specialize methods for parallel execution
    • [x] Add required MPI data structures to DG solver
    • [x] Calculate L2/Linf errors in parallel
    • [x] Exchange MPI surface data in DG solver
    • [x] Validate existing uniform setups in parallel
    • [x] Parallel I/O (poor man's version)
    • [x] Testing in parallel
  • [x] WP2: Non-uniform meshes (#898) (achievements: domain partitioning/space filling curve)
    • [x] Add required MPI data structures for MPI mortars to DG solver
    • [x] Allow arbitrary numbers of domains
  • [x] WP3: AMR (#361) (achievements: infrastructure for parallel mesh and solver adaptation)
    • [x] Distribute AMR indicators in parallel and adapt mesh and solver
    • [x] Re-build MPI data structures in mesh and solver
  • [x] WP4: Load balancing (#390) (achievements: ability to redistribute solver data and to do "online restarts")
    • [x] Implement basic load balancing based on cell count
  • [x] WP 5: Extension to 3D (#1062, #1091) (achievements: everything works in 3D)
  • [ ] WP 6: Support for multi-physics simulations (achievements: all Euler-gravity tests work in parallel)
    • [ ] Static mesh
    • [ ] AMR mesh
    • [ ] Load balancing

Unassigned:

  • [ ] Support for MHD (missing: non-conservative surface flux support)
  • [ ] Support for alpha smoothing in calc_blending_factors
  • [ ] TreeMesh: Implement partitioning scheme that preserves spatial locality (probably Morton or Hilbert space-filling curve)
  • [ ] Weight-based load balancing
  • [ ] Parallel I/O with external parallel HDF5 library

sloede avatar Sep 04 '20 03:09 sloede

Hi I've been curiously tracking #167 and it seems that MPI will inevitably become a core part of Trixi.jl, i.e. you cannot run Trixi.jl without MPI anymore.

I'm not familiar with the Trixi.jl numerics but is it possible to add distributed parallelism as a layer on top of the existing package? Can a distributed Trixi.jl model be built by stiching together a bunch of Trixi.jl models (1 per rank) that communicate with each other as needed?

I ask since I've been working on supporting distributed parallelism in Oceananigans.jl (https://github.com/CliMA/Oceananigans.jl/pull/590 but haven't done much since January...) and I'm trying this distributed layer approach so the existing package behavior is completely unmodified but I'm not sure if it's going to work. Maybe it'll work since Oceananigans.jl is finite volume but curious to know whether you guys have considered a similar approach (or if it's even possible).

ali-ramadhan avatar Sep 26 '20 20:09 ali-ramadhan

Hi @ali-ramadhan! Thank you for your comments - I'll try to respond to them one by one.

[...] it seems that MPI will inevitably become a core part of Trixi.jl, i.e. you cannot run Trixi.jl without MPI anymore.

This is true in the sense that we made MPI.jl an explicit dependency instead of using Requires.jl - for the time being, that is. However, we decided to try a route that maintains a parallel, non-MPI implementation, such that there is no hard algorithmic dependency. This is mainly motivated by the fact that we want Trixi.jl to be accessible also for new and inexperienced users (such as students), and they should be able to work with Trixi without having to deal with any MPI details explicitly.

One example can be found here: https://github.com/trixi-framework/Trixi.jl/blob/d635cbd2c279e069a332e422ba65fa31c0f77512/src/auxiliary/auxiliary.jl#L16-L33 Those not interested in parallel computation only need to read and understand the implementation of the parse_parameters_file methods starting in lines 16 and 17, but not the one starting in line 21. While this undoubtedly introduces some code redundancies, it allows the freedom to choose not to implement a particular feature for parallel simulations, and keeps the serial free of MPI specifics (except the added Val parameter used for dispatch).

However, this is not to say that we might not change this dependency to an optional one in the future... the final verdict is still out on this, since we wanted to gather more practical experience first.

I'm not familiar with the Trixi.jl numerics but is it possible to add distributed parallelism as a layer on top of the existing package? Can a distributed Trixi.jl model be built by stiching together a bunch of Trixi.jl models (1 per rank) that communicate with each other as needed?

In general - and with some restrictions - such an approach would be possible with the DGSEM scheme as well, yes, by leveraging the fact that in the DG method elements are only coupled via the solutions on their faces. As far as I can tell, in CliMA/Oceananigans.jl#590 you try to keep MPI out of the core implementation by using a special boundary condition that exchanges all required data in a way that is transparent to the user. However, I believe there are certain functional limitations you will not get around with such an approach, e.g., if you use explicit time stepping with a global time step or for everything related to I/O. Also, what happens if you need other quantities than the state variables in your halo layer? Thus, depending on the complexity of the numerical methods you want to support, a completely orthogonal separation of MPI code and solver code might make many implementations much more complicated than they necessarily have to be. We therefore opted for an approach where it is OK to add MPI-specific code to the solver, but to use dynamic dispatch to keep it away as much as possible from a user who does not care about distributed parallelism.

[...] I'm trying this distributed layer approach so the existing package behavior is completely unmodified but I'm not sure if it's going to work. Maybe it'll work since Oceananigans.jl is finite volume but curious to know whether you guys have considered a similar approach (or if it's even possible)

As stated above, I think this is possible, yes, but only if you accept some restrictions in terms of supported features and performance. For example, with an approach that hides all communication inside the boundary conditions, it becomes hard to overlap communication with computation (probably even impossible, unless you have multiple sweeps over your boundary conditions in each time step).

I guess in the end it boils down to what your priorities are for Oceananigans.jl: Do you want a simple, accessible code that is easy to understand also for the casual user, or do you want to HPC-optimized code that scales well to thousands (or even hundreds of thousands?) of cores? I doubt that it is possible to achieve both goals at the same time. With Trixi.jl we somewhat aim for the middle ground. But since efficient distributed parallelism is essentially a Really Hard Problem™, I am not sure whether this would suit your needs as well.

OK, this turned out to be a somewhat longer post than expected, and I am still not sure whether I fully answered your questions :thinking: Let me know if not; I'd be happy to further discuss possible MPI strategies in Julia :wink:

sloede avatar Sep 28 '20 14:09 sloede

Thank you for the detailed reply @sloede!

However, we decided to try a route that maintains a parallel, non-MPI implementation, such that there is no hard algorithmic dependency. This is mainly motivated by the fact that we want Trixi.jl to be accessible also for new and inexperienced users (such as students), and they should be able to work with Trixi without having to deal with any MPI details explicitly.

That would be awesome! Certainly I feel like there's a lack of high performance models accessible to beginners and Julia/Trixi.jl could help a lot here.

However, this is not to say that we might not change this dependency to an optional one in the future... the final verdict is still out on this, since we wanted to gather more practical experience first.

Yeah I get the feeling that new solutions are possible with Julia but there aren't many examples out there (if any?) of large HPC simulation packages that are both beginner-friendly and super performant, so I guess we have to try out different solutions.

As far as I can tell, in CliMA/Oceananigans.jl#590 you try to keep MPI out of the core implementation by using a special boundary condition that exchanges all required data in a way that is transparent to the user. However, I believe there are certain functional limitations you will not get around with such an approach, e.g., if you use explicit time stepping with a global time step or for everything related to I/O.

Yes that are definitely limitations. I guess for Oceananigans.jl it only makes sense to take global time steps, but the I/O limitations are definitely a concern. We would have to add extra code to handle I/O across ranks. My current idea is for each rank to output to it's own file since this is possible with existing output writers, then have some post-processor combine all the files at the end of a simulation or something.

Also, what happens if you need other quantities than the state variables in your halo layer?

Hmmm, that's an interesting point. I guess we only store primitive state variables in memory and all intermediate variables are computed on-the-fly in CPU/GPU kernels. This was an optimization we took to maximize available memory on the GPU so we can fit larger models in memory, perhaps to the detriment of CPU models (slightly). But maybe it helps us with MPI if I understood your point correctly?

Thus, depending on the complexity of the numerical methods you want to support, a completely orthogonal separation of MPI code and solver code might make many implementations much more complicated than they necessarily have to be. We therefore opted for an approach where it is OK to add MPI-specific code to the solver, but to use dynamic dispatch to keep it away as much as possible from a user who does not care about distributed parallelism.

That makes a lot of sense. Perhaps I've convinced myself that a completely orthogonal approach should be fine for Oceananigans.jl but I might encounter unpleasant surprises haha. For sure some existing features will have be re-implemented or modified for distributed models via dispatch and maybe some features won't make it into distributed models.

I guess in the end it boils down to what your priorities are for Oceananigans.jl: Do you want a simple, accessible code that is easy to understand also for the casual user, or do you want to HPC-optimized code that scales well to thousands (or even hundreds of thousands?) of cores? I doubt that it is possible to achieve both goals at the same time. With Trixi.jl we somewhat aim for the middle ground. But since efficient distributed parallelism is essentially a Really Hard Problem™, I am not sure whether this would suit your needs as well.

Julia makes me feel like we can have everything haha although we have favored GPU optimizations over CPU optimizations so far, so maybe we will find out that our design choices hurt us when scaling up distributed models vs the Really Hard Problem™. I believe we can have scalable and HPC-optimized code that is beginner-friendly with Julia, it might take a lot of work and refactoring to get there?

I guess I'll have to finish up https://github.com/CliMA/Oceananigans.jl/pull/590 and see what the benchmarks look like.

ali-ramadhan avatar Oct 01 '20 15:10 ali-ramadhan

@sloede I guess we can update some points?

ranocha avatar Apr 05 '22 06:04 ranocha

Thanks, yes indeed. I updated the description to reflect the work that has been done. There are a few points that I moved towards the "unassigned" area at the bottom, plus WP6 (multi-physics parallelization). I am leaning towards moving the Multi-physics stuff into its own issue, but I am not sure whether the other points should become separate issues or be collected in a meta issue until they are actually being worked on. What's your take on this?

Ultimately, I would like to close this issue since I think we have reached the point of "initial support for MPI parallelization".

sloede avatar Apr 05 '22 08:04 sloede

As long as there aren't any plans for somebody working on this, it's fine to keep the status quo. If we start working on it, we can track discussions/progress either in PRs or separate issues. I don't have a strong opinion on this either.

ranocha avatar Apr 05 '22 08:04 ranocha

WP5 sounds like the ParallelTreeMesh works in 3d, but it doesn't, right, @sloede, @ranocha? Maybe adjust this point accordingly?

JoshuaLampert avatar May 23 '23 20:05 JoshuaLampert

Correct

ranocha avatar May 24 '23 03:05 ranocha

The last point in the list can be ticked (or refer to #1332).

JoshuaLampert avatar May 26 '23 11:05 JoshuaLampert