ClimaCoupler.jl
ClimaCoupler.jl copied to clipboard
O5.2.1 (coupler) Distributed unstructured sparse CPU & GPU matrix-vector multiply, supporting mixed precision
Purpose
We want to implement distributed matrix multiplication to enable parallel online regridding in the coupler.
Cost/Benefits/Risks
- Costs:
- Developer time
- Risks:
- Incorrect regridding results if implementation is incorrect
- Will be minimized by first producing a simple MWE, then more complicated implementations, and testing as we go
- Incorrect regridding results if implementation is incorrect
- Benefits:
- Faster regridding
People and Personnel
- Lead: @juliasloan25
- Collaborators: @LenkaNovak
- Reviewers: @simonbyrne @sriharshakandala
Components
There are two major steps involved in regridding:
- Initialization of the sparse weight matrix a. This will be done on one process (without MPI) using TempestRemap
- Construction of the target data via multiplication of the source data and weight matrix a. This will be done using MPI, and is what we must implement here
We can divide the process up into even more granular steps:
- Generate weight matrix a. This will be done on one process using serial spaces. This is the only location where serial spaces will be used in this implementation.
- Distribute weights to responsible processes
a. Use
MPI.scatterv
to send info from root process to all processes (note processes may be responsible for varying numbers of weights, so we can't usescatter
). b. We can't exchange a sparse matrix object type, so instead exchange 3 arrays: nonzero values, row indices, and column offsets. c. We may also need the root process to send the number of nonzero weights each process is responsible for. - Exchange source data a. At first, send all source data to all processes. b. For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process. c. Eventually, maximize the number of cases where corresponding source and target data are stored on the same process. In this case, information exchange is not needed so this will minimize exchange operations.
- Perform remapping a. Multiplication on send side: calculate local row dot products and remote row dot products. b. Use ClimaComms graph context to exchange remote dot products, add these to the previously computed local sums.
- Project remapped data back onto ClimaCore space?
Implementation phases
1. Initial implementation of distributed regridding - DONE
- Generate the source data field on the non-distributed source space
- Distribute this source data to all processes using built-in MPI broadcast
- Construct the target data field by multiplying the source data with the weight matrix at certain indices
- This will be done distributedly, as a set of local multiplications
- Global source data is available to each process (this will be optimized later on - see below)
- Target data is local to each process and we skip multiplication when global target indices don't correspond to the given process
- Use unweighted DSS to broadcast redundant node values of the target data
2. Second implementation - store only local information in each process's LinearMap
- Construct a map from global unique indices to local indices on each process.
- Simplify the implementation of
remap!
by converting unique indices to local indices ingenerate_map
and storing then inLinearMap
.
- Simplify the implementation of
- Using this map, construct a
LinearMap
on each process containing only the information relevant for that process.
3. Optimized implementation using only the necessary source data
- Use "super-halo" information exchange to send source data between processes.
- On each process, use R to determine source indices with non-zero weights. Fill send buffer with the source data at these indices.
- For the first iteration, send all source data needed by any other process to all other processes.
- For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process.
- Note that when source data, target data, and the corresponding weights are all stored on the same process, no information exchange is needed. At a later time, we will optimize the solution by maximizing the amount of data in this case.
Inputs
- Source data
- Source and target spaces
- Weight matrix produced by TempestRemap, including indices and values of non-zero weights
Results and deliverables
Functions for distributed matrix multiplication in ClimaCoreTempestRemap, and tests for these functions.
Tests include:
- Tests comparing results of distributed (new) and serial (existing) regridding
- Tests comparing results of distributed regridding to expected analytical output
Current status
As of ClimaCore PR #1107 (distributed regridding v1), we are able to regrid from serial spaces to distributed spaces. Note that this regridding only works when the source and target meshes are collocated.
ClimaCore PR #1192 cleans up this implementation a bit by storing local indices in the LinearMap
object itself (which is constructed only once), rather than computing them in the remap!
function (which gets called multiple times). Also see https://github.com/CliMA/ClimaCore.jl/issues/1195 for more information.
A concrete example of the distributed regridding is partially implemented in ClimaCore PR #1259. This has been tested on 2 processes when remapping from 2 to 3 elements, and appears correct when compare to serial regridding results. Future work could test this implementation with more than 2 processes, and with more elements than just 2 -> 3.
The next steps are to rework our implementation so that each process uses only its local information and communicated information to perform the remapping. This is different from the distributed regridding v1, which does most of the work on the root process and then broadcasts it. Some of the logic for the distributed approach using MPI can be found in the concrete example, such as performing the weight/source data multiplication on the source side and exchanging these products then recombining them on the receive side. This next implementation should allow us to be able to perform regridding from a distributed source space to a distributed target space.
Task Breakdown And Tentative Due Date
-
[x] Minimum working example of the initial implementation (no information exchange) with tests for correctness [7 Apr] - ClimaCore PR #1107
-
[x] In
generate_map
, convert unique indices (target_idxs
) to local indices usingtarget_global_elem_lidx
. Instead of usingtarget_global_elem_lidx
to convert global to local indices inremap!
, do this ingenerate_map
and store the local indices inLinearMap
. [21 Apr] - ClimaCore issue #1191, PR #1192 -
[x] Write out a simple concrete example (e.g. remapping from 2 to 3 elements with 4 nodes each) using distributed spaces and MPI to exchange information. This will help us understand how to use MPI functions (i.e.
scatter
,scatterv
) for our case and then generalize this approach. [18 Aug] - ClimaCore PR #1259- 22 Aug: Initial prototype works (as compared to serial remapping results), could use more testing and generalization (see issue for details)
-
[ ] Adapt the concrete example to use a weight matrix generated by TempestRemap. This will likely primarily involve index conversions. [22 Sept]
-
[ ] Super-halo exchange: Create a mapping from sparse indices to the neighbor pid and local index on that process of source data. Use this with our buffer struct to generate source data on the distributed space and send only the relevant data to each process, as opposed to sending all source data. [6 Oct]
-
[ ] ~~On each process, store only the local components of
LinearMap
data (source_idxs
,target_idxs
,weights
,row_indices
). This will allow us to iterate over only the truncated (local) weights inremap!
. To do this, we need to create send and receive buffers on each process and use MPI'sscatterv
function. [29 Sept]~~- 22 August update: This is included in the prototype mentioned above and doesn't need to be a separate component of the implementation.
-
[ ] ~~Super-halo exchange - implement a buffer struct containing send and receive data to generate source data on the distributed space and send all source data to all processes. Previously we have been generating the source data on a serial space. [13 Oct]~~
- 22 August update: The prototype already generates source data on the distributed space. The super-halo exchange is already described in a previous point.
-
[ ] Implement an example with a Buildkite driver [20 Oct]
-
[ ] Add thorough documentation - perhaps a tutorial or complete API docs [20 Oct]
- Timeline delayed due to Julia OOO March 20-Apr 3
- Timeline delayed due to break as of May 18
Proposed Delivery Date
20 Oct 2023
SDI Revision Log
- After an attempt at the initial implementation, we realized we would need to make modifications to our approach, including an extended "super-halo" exchange instead of using the existing halo exchange setup.
- @LenkaNovak approved on 17 Feb
- more granular implementation revised now that we've specified / agreed on (also with @simonbyrne ) the next explicit steps.
- @LenkaNovak approved on 16 Mar
- After attempting to implement
generate_map
using only distributed spaces, we realized this would be quite complicated since we need to interface with TempestRemap and keep track of the global unique indices of data and weights. To get around this, we're constructing mappings between indices and will use those to index into the distributed data locally.- @LenkaNovak approved on 14 Apr
- As of 18 May, we have decided to put this project on hold for ~3 weeks. This functionality is not very urgent, and we felt that taking some time away from it to focus on more pressing issues would be worthwhile. We hope that after taking a break from this work and returning to it with fresh eyes, we'll be able to more efficiently progress on it. We will tentatively plan to return to this on 12 June (end of academic term). The timelines have been adjusted accordingly.
- @LenkaNovak approved on 22 May
- @juliasloan25 added more details under
Components
and revised timelines to more accurately reflect our plan after returning to this project.