atlas icon indicating copy to clipboard operation
atlas copied to clipboard

Offload matrix based interpolation

Open l90lpa opened this issue 1 year ago • 9 comments

I've opened this as a draft PR to get early feedback and to recognise that we might want to iterate on the design.

This PR:

  • adds a SparseMatrix class into Atlas that wraps eckit's SparseMatrix class to combine it with GPU memory management
  • adds a hicsparse backend to Atlas's sparse linear algebra interface
  • adds "multiply-add" support into Atlas's sparse linear algebra interface
  • updates interpolation to support use of the hicparse backend of Atlas's sparse linear algebra interface

l90lpa avatar Nov 16 '24 00:11 l90lpa

Thanks @l90lpa, so far I had a first look at the atlas::linalg::SparseMatrix class only.

I am making a note that in principle it could also just inherit from eckit::linalg::SparseMatrix, and add the device capability on top. On the other hand this way it's possible to implement it differently, and construct an eckit::linalg::SparseMatrix only when needed.

wdeconinck avatar Nov 18 '24 21:11 wdeconinck

It should be mentioned explicitly that this PR depends on first merging #237 and should then be rebased on develop.

wdeconinck avatar Nov 18 '24 21:11 wdeconinck

Perhaps this PR is adding too many ingredients at once. Maybe we could just focus on the gpu-offloading first and add the 'multiply-add' later?

In that respect of multiply-add (for a new PR), should matrix_multiply_add not immediately also implement the following formula including $$\alpha$$ and $$\beta$$?

y = \alpha A x + \beta y

I guess here you have $$\alpha=1$$ and $$\beta=1$$, so that:

y = A x + y

That could still be a simpler overloaded API for sure.

wdeconinck avatar Nov 19 '24 11:11 wdeconinck

Perhaps this PR is adding too many ingredients at once. Maybe we could just focus on the gpu-offloading first and add the 'multiply-add' later?

In that respect of multiply-add (for a new PR), should matrix_multiply_add not immediately also implement the following formula including α and β ? y = α A x + β y

I guess here you have α = 1 and β = 1 , so that: y = A x + y

That could still be a simpler overloaded API for sure.

Regarding PR content: I'm happy to break-up the PR however you would like. I only included the complete work so that you would be able to see an overview in advance. That said, I just want to mention that the reason multiply-add is part of this work is to facilitate the GPU offload of interpolation's execute_adjoint.

Regarding multiply-add: whether matrix_multiply_add should immediately implement the more broad interface is up to you (I'm happy either way). I chose not to, because I wasn't aware of a need for the additional behaviour, and so I didn't want to introduce an interface that wouldn't get used.

l90lpa avatar Nov 19 '24 12:11 l90lpa

I appreciate very much this "draft" PR to show the complete work! Thanks!  🙏 I had a deeper look now and I definitely like your approach taken so far.

Ideally with the overall design in mind now this can be split several different PRs in this order:

  1. Implementing multiply-add, possibly with the extended capability using $$\alpha$$ and $$\beta$$... ...and update the adjoint interpolation methods to use this. Because this routine did not exist before, @MarekWlasak implemented the adjoint as first a matrix-multiply followed by a += . Perhaps this approach should still be used for the eckit-backend codepaths, and could possibly live inside the matrix_multiply_add implementation for the eckit-backend (within atlas) itself.
  2. Implement GPU-offloading capability for the new sparse matrix
  3. Create hicsparse backend
  4. Adapt interpolation methods to enable the GPU offloading.

wdeconinck avatar Nov 19 '24 14:11 wdeconinck

That sounds like a great plan, and thanks for taking a look so far! I'll work on submitting those PR's.

l90lpa avatar Nov 19 '24 14:11 l90lpa

I've submitted PR's for the first 2 items (https://github.com/ecmwf/atlas/pull/240 and https://github.com/ecmwf/atlas/pull/241). And, I'll submitted the remaining 2 PRs as their dependencies pass review and get merged in.

l90lpa avatar Nov 21 '24 21:11 l90lpa

Dear @l90lpa , steps 1, 2, 3 above have been incorporated now:

  • 1: #240
  • 2: #247
  • 3: #246

so that all that needs to be done now is step 4

  • 4: Adapt interpolation methods to enable the GPU offloading.

wdeconinck avatar Mar 14 '25 11:03 wdeconinck

Hi @wdeconinck, yes that's right. I have a branch (on a fork) that does that contains an implementation of that adaptation. I just haven't opened a PR because priorities changed but I could revise it and open the PR next week if that works for you?

l90lpa avatar Mar 14 '25 11:03 l90lpa