ClimaCoupler.jl O5.2.1 (coupler) Distributed unstructured sparse CPU & GPU matrix-vector multiply, supporting mixed precision

O5.2.1 (coupler) Distributed unstructured sparse CPU & GPU matrix-vector multiply, supporting mixed precision

Open juliasloan25 opened this issue 2 years ago • 10 comments

Purpose

We want to implement distributed matrix multiplication to enable parallel online regridding in the coupler.

Cost/Benefits/Risks

Costs:
- Developer time
Risks:
- Incorrect regridding results if implementation is incorrect
  - Will be minimized by first producing a simple MWE, then more complicated implementations, and testing as we go
Benefits:
- Faster regridding

People and Personnel

Lead: @juliasloan25
Collaborators: @LenkaNovak
Reviewers: @simonbyrne @sriharshakandala

Components

There are two major steps involved in regridding:

Initialization of the sparse weight matrix a. This will be done on one process (without MPI) using TempestRemap
Construction of the target data via multiplication of the source data and weight matrix a. This will be done using MPI, and is what we must implement here

We can divide the process up into even more granular steps:

Generate weight matrix a. This will be done on one process using serial spaces. This is the only location where serial spaces will be used in this implementation.
Distribute weights to responsible processes a. Use MPI.scatterv to send info from root process to all processes (note processes may be responsible for varying numbers of weights, so we can't use scatter). b. We can't exchange a sparse matrix object type, so instead exchange 3 arrays: nonzero values, row indices, and column offsets. c. We may also need the root process to send the number of nonzero weights each process is responsible for.
Exchange source data a. At first, send all source data to all processes. b. For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process. c. Eventually, maximize the number of cases where corresponding source and target data are stored on the same process. In this case, information exchange is not needed so this will minimize exchange operations.
Perform remapping a. Multiplication on send side: calculate local row dot products and remote row dot products. b. Use ClimaComms graph context to exchange remote dot products, add these to the previously computed local sums.
Project remapped data back onto ClimaCore space?

Implementation phases

1. Initial implementation of distributed regridding - DONE

Generate the source data field on the non-distributed source space
- Distribute this source data to all processes using built-in MPI broadcast
Construct the target data field by multiplying the source data with the weight matrix at certain indices
- This will be done distributedly, as a set of local multiplications
- Global source data is available to each process (this will be optimized later on - see below)
- Target data is local to each process and we skip multiplication when global target indices don't correspond to the given process
Use unweighted DSS to broadcast redundant node values of the target data

2. Second implementation - store only local information in each process's `LinearMap`

Construct a map from global unique indices to local indices on each process.
- Simplify the implementation of remap! by converting unique indices to local indices in generate_map and storing then in LinearMap.
Using this map, construct a LinearMap on each process containing only the information relevant for that process.

3. Optimized implementation using only the necessary source data

Use "super-halo" information exchange to send source data between processes.
- On each process, use R to determine source indices with non-zero weights. Fill send buffer with the source data at these indices.
- For the first iteration, send all source data needed by any other process to all other processes.
- For the next iteration, determine which process needs to receive each index of source data, and send the data only to the correct process.
- Note that when source data, target data, and the corresponding weights are all stored on the same process, no information exchange is needed. At a later time, we will optimize the solution by maximizing the amount of data in this case.

Inputs

Source data
Source and target spaces
Weight matrix produced by TempestRemap, including indices and values of non-zero weights

Results and deliverables

Functions for distributed matrix multiplication in ClimaCoreTempestRemap, and tests for these functions.

Tests include:

Tests comparing results of distributed (new) and serial (existing) regridding
Tests comparing results of distributed regridding to expected analytical output

Current status

As of ClimaCore PR #1107 (distributed regridding v1), we are able to regrid from serial spaces to distributed spaces. Note that this regridding only works when the source and target meshes are collocated.

ClimaCore PR #1192 cleans up this implementation a bit by storing local indices in the LinearMap object itself (which is constructed only once), rather than computing them in the remap! function (which gets called multiple times). Also see https://github.com/CliMA/ClimaCore.jl/issues/1195 for more information.

A concrete example of the distributed regridding is partially implemented in ClimaCore PR #1259. This has been tested on 2 processes when remapping from 2 to 3 elements, and appears correct when compare to serial regridding results. Future work could test this implementation with more than 2 processes, and with more elements than just 2 -> 3.

The next steps are to rework our implementation so that each process uses only its local information and communicated information to perform the remapping. This is different from the distributed regridding v1, which does most of the work on the root process and then broadcasts it. Some of the logic for the distributed approach using MPI can be found in the concrete example, such as performing the weight/source data multiplication on the source side and exchanging these products then recombining them on the receive side. This next implementation should allow us to be able to perform regridding from a distributed source space to a distributed target space.

Task Breakdown And Tentative Due Date

[x] Minimum working example of the initial implementation (no information exchange) with tests for correctness [7 Apr] - ClimaCore PR #1107
- [x] To do this we first need to update ClimaCoreTempestRemap's write_exodus to handle distributed topology inputs - see ClimaCore Issue #1114 and PR #1117
[x] In generate_map, convert unique indices (target_idxs) to local indices using target_global_elem_lidx. Instead of using target_global_elem_lidx to convert global to local indices in remap!, do this in generate_map and store the local indices in LinearMap. [21 Apr] - ClimaCore issue #1191, PR #1192
[x] Write out a simple concrete example (e.g. remapping from 2 to 3 elements with 4 nodes each) using distributed spaces and MPI to exchange information. This will help us understand how to use MPI functions (i.e. scatter, scatterv) for our case and then generalize this approach. [18 Aug] - ClimaCore PR #1259
- 22 Aug: Initial prototype works (as compared to serial remapping results), could use more testing and generalization (see issue for details)
[ ] Adapt the concrete example to use a weight matrix generated by TempestRemap. This will likely primarily involve index conversions. [22 Sept]
[ ] Super-halo exchange: Create a mapping from sparse indices to the neighbor pid and local index on that process of source data. Use this with our buffer struct to generate source data on the distributed space and send only the relevant data to each process, as opposed to sending all source data. [6 Oct]
[ ] ~~On each process, store only the local components of LinearMap data (source_idxs, target_idxs, weights, row_indices). This will allow us to iterate over only the truncated (local) weights in remap!. To do this, we need to create send and receive buffers on each process and use MPI's scatterv function. [29 Sept]~~
- 22 August update: This is included in the prototype mentioned above and doesn't need to be a separate component of the implementation.
[ ] ~~Super-halo exchange - implement a buffer struct containing send and receive data to generate source data on the distributed space and send all source data to all processes. Previously we have been generating the source data on a serial space. [13 Oct]~~
- 22 August update: The prototype already generates source data on the distributed space. The super-halo exchange is already described in a previous point.
[ ] Implement an example with a Buildkite driver [20 Oct]
[ ] Add thorough documentation - perhaps a tutorial or complete API docs [20 Oct]

Timeline delayed due to Julia OOO March 20-Apr 3
Timeline delayed due to break as of May 18

Proposed Delivery Date

20 Oct 2023

SDI Revision Log

After an attempt at the initial implementation, we realized we would need to make modifications to our approach, including an extended "super-halo" exchange instead of using the existing halo exchange setup.
- @LenkaNovak approved on 17 Feb
more granular implementation revised now that we've specified / agreed on (also with @simonbyrne ) the next explicit steps.
- @LenkaNovak approved on 16 Mar
After attempting to implement generate_map using only distributed spaces, we realized this would be quite complicated since we need to interface with TempestRemap and keep track of the global unique indices of data and weights. To get around this, we're constructing mappings between indices and will use those to index into the distributed data locally.
- @LenkaNovak approved on 14 Apr
As of 18 May, we have decided to put this project on hold for ~3 weeks. This functionality is not very urgent, and we felt that taking some time away from it to focus on more pressing issues would be worthwhile. We hope that after taking a break from this work and returning to it with fresh eyes, we'll be able to more efficiently progress on it. We will tentatively plan to return to this on 12 June (end of academic term). The timelines have been adjusted accordingly.
- @LenkaNovak approved on 22 May
@juliasloan25 added more details under Components and revised timelines to more accurately reflect our plan after returning to this project.

Dec 12 '22 23:12 juliasloan25

ClimaCoupler.jl ClimaCoupler.jl copied to clipboard

O5.2.1 (coupler) Distributed unstructured sparse CPU & GPU matrix-vector multiply, supporting mixed precision

Purpose

Cost/Benefits/Risks

People and Personnel

Components

Implementation phases

1. Initial implementation of distributed regridding - DONE

2. Second implementation - store only local information in each process's LinearMap

3. Optimized implementation using only the necessary source data

Inputs

Results and deliverables

Current status

Task Breakdown And Tentative Due Date

Proposed Delivery Date

SDI Revision Log

ClimaCoupler.jl
ClimaCoupler.jl copied to clipboard

2. Second implementation - store only local information in each process's `LinearMap`