CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

decomp: Testing for MPI_SCAN for the decomposition

Open ekluzek opened this issue 3 months ago • 14 comments

Description of changes

Experimental branch for working with using MPI_SCAN for the decomposition to get the gridcell offsets.

This is starting off as very experimental with calls to asserts to make sure it gives the same results as before.

Specific notes

Contributors other than yourself, if any: @johnmauf

CTSM Issues Fixed (include github issue #): Work for #2995 Fixes #3370

Are answers expected to change (and if so in what way)? no

Any User Interface Changes (namelist or namelist defaults changes)? No

Does this create a need to change or add documentation? Did you do so? No No

Testing performed, if any: Running the decomp_init testlist

ekluzek avatar Sep 03 '25 19:09 ekluzek

@johnmauff and I worked on the code to get the mpi_scan testing working. I need to fill out a few more things and make sure it all works, but this was an important step to progress here.

ekluzek avatar Sep 04 '25 18:09 ekluzek

OK, I have the new code functional now, and tried it out with mpas3p75.

Here's timing results, which does improve, clm_init2 was 42 sec.

          clm_init2                                                                  10240  10240  1        26.3516     26.3311     9407    26.3628     4084
            clm_init2_part1                                                          10240  10240  1        16.9451     16.9385     9628    16.9533     4972
            clm_init2_part3                                                          10240  10240  1        8.3216      8.3195      9601    8.3246      260
              clm_instInit_part1                                                     10240  10240  1        5.9943      5.9924      9674    5.9975      260
              clm_instInit_part2                                                     10240  10240  1        2.0007      1.9890      9756    2.0082      3258
              clm_instInit_part3                                                     10240  10240  1        0.3227      0.3100      7045    0.3369      4105
            clm_decompInit_clumps                                                    10240  10240  1        0.5077      0.5067      6460    0.5094      1896
            clm_decompInit_glcp                                                      10240  10240  1        0.2091      0.2044      9420    0.2149      6575
            clm_init2_snow_soil_init                                                 10240  10240  1        0.1811      0.1800      3313    0.1812      1
            clm_init2_part2                                                          10240  10240  1        0.1556      0.1517      471     0.1598      9673
            clm_init2_part5                                                          10240  10240  1        0.0190      0.0057      7368    0.0237      1932
            clm_init2_subgrid                                                        10240  10240  1        0.0064      0.0010      7186    0.0094      4823
            clm_init2_part4                                                          10240  10240  1        0.0059      0.0011      8039    0.0102      1077
          clm_init1                                                                  10240  10240  1        0.4621      0.4507      4391    0.4660      6529

And memory:

    VmPeak    VmSize     VmLck     VmPin     VmHWM     VmRSS    VmData     VmStk     VmExe     VmLib     VmPTE     vmPMD    VmSwap
   2611484   2611356         0         0   1263408   1057192   1509212    163980     20436    202756      4308        -1         0 [VmStatus] CTSM(Memory check): decompInit_lnd: after allocate

From the spreadsheet memory was

RSS 8516 Data 229384 Size 229384 Peak 49156

ekluzek avatar Sep 14 '25 23:09 ekluzek

Erik,

I am confused here. I thought that the huge initialization time we were attempting to eliminate was in decomp_and_domain_from_readmesh. What impact did the MPI_scan have on that section of the code?

John

On Sun, Sep 14, 2025 at 5:52 PM Erik Kluzek @.***> wrote:

ekluzek left a comment (ESCOMP/CTSM#3469) https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3290051403

OK, I have the new code functional now, and tried it out with mpas3p75.

Here's timing results, which does improve, clm_init2 was 42 sec.

      clm_init2                                                                  10240  10240  1        26.3516     26.3311     9407    26.3628     4084
        clm_init2_part1                                                          10240  10240  1        16.9451     16.9385     9628    16.9533     4972
        clm_init2_part3                                                          10240  10240  1        8.3216      8.3195      9601    8.3246      260
          clm_instInit_part1                                                     10240  10240  1        5.9943      5.9924      9674    5.9975      260
          clm_instInit_part2                                                     10240  10240  1        2.0007      1.9890      9756    2.0082      3258
          clm_instInit_part3                                                     10240  10240  1        0.3227      0.3100      7045    0.3369      4105
        clm_decompInit_clumps                                                    10240  10240  1        0.5077      0.5067      6460    0.5094      1896
        clm_decompInit_glcp                                                      10240  10240  1        0.2091      0.2044      9420    0.2149      6575
        clm_init2_snow_soil_init                                                 10240  10240  1        0.1811      0.1800      3313    0.1812      1
        clm_init2_part2                                                          10240  10240  1        0.1556      0.1517      471     0.1598      9673
        clm_init2_part5                                                          10240  10240  1        0.0190      0.0057      7368    0.0237      1932
        clm_init2_subgrid                                                        10240  10240  1        0.0064      0.0010      7186    0.0094      4823
        clm_init2_part4                                                          10240  10240  1        0.0059      0.0011      8039    0.0102      1077
      clm_init1                                                                  10240  10240  1        0.4621      0.4507      4391    0.4660      6529

And memory:

VmPeak    VmSize     VmLck     VmPin     VmHWM     VmRSS    VmData     VmStk     VmExe     VmLib     VmPTE     vmPMD    VmSwap

2611484 2611356 0 0 1263408 1057192 1509212 163980 20436 202756 4308 -1 0 [VmStatus] CTSM(Memory check): decompInit_lnd: after allocate

From the spreadsheet memory was

RSS 8516 Data 229384 Size 229384 Peak 49156

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3290051403, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH7NURATUZC3H2SE3Q5Y7T3SX5SVAVCNFSM6AAAAACFRSDRR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOJQGA2TCNBQGM . You are receiving this because you were mentioned.Message ID: @.***>

johnmauff avatar Sep 15 '25 11:09 johnmauff

Erik, I am confused here. I thought that the huge initialization time we were attempting to eliminate was in decomp_and_domain_from_readmesh. What impact did the MPI_scan have on that section of the code? John

It is one of the subroutine calls in there. So it is helping.

Here's some of the timers around that subroutine and some of the sections for it with the update:

          lc_lnd_set_decomp_and_domain_from_readmesh                                 10240  10240  1        17.3806     17.3727     4972    17.3854     9639
            lnd_set_decomp_and_domain_from_readmesh: ESMF mesh                       10240  10240  1        13.3492     13.3273     9617    13.6642     10233

              lnd_set_lndmask_from_maskmesh                                          10240  10240  1        9.4614      9.4388      9617    9.7779      10233
            lnd_set_decomp_and_domain_from_readmesh: decomp_init                     10240  10240  1        0.7336      0.5535      6040    1.0441      68
              decompInit_lnd                                                         10240  10240  1        0.6134      0.4340      3978    0.8958      80

So decompInit_lnd is now essentially nothing, which did help.

ESMF mesh part of lnd_set_decomp_and_domain_from_readmesh:

But, the other thing is that the biggest timer for the whole subroutine is the "ESMF mesh" part (see it below here), so there isn't much (if anything?) we can do about it. The decompInit_lnd code is something we can improve and also address non-scalable memory.

          field_lnd = ESMF_FieldCreate(mesh_lndinput, ESMF_TYPEKIND_R8, meshloc=ESMF_MESHLOC_ELEMENT, rc=rc)
          if (ChkErr(rc,__LINE__,u_FILE_u)) return
          field_ctsm = ESMF_FieldCreate(mesh_ctsm, ESMF_TYPEKIND_R8, meshloc=ESMF_MESHLOC_ELEMENT, rc=rc)
          if (ChkErr(rc,__LINE__,u_FILE_u)) return
          call ESMF_FieldRedistStore(field_lnd, field_ctsm, routehandle=rhandle_lnd2ctsm, &
               ignoreUnmatchedIndices=.true., rc=rc)
          if (chkerr(rc,__LINE__,u_FILE_u)) return
          call ESMF_FieldGet(field_lnd, farrayptr=dataptr1d, rc=rc)
          if (chkerr(rc,__LINE__,u_FILE_u)) return
          do n = 1,size(dataptr1d)
             dataptr1d(n) = lndfrac_loc_input(n)
          end do
          call ESMF_FieldRedist(field_lnd, field_ctsm, routehandle=rhandle_lnd2ctsm, rc=rc)
          if (chkerr(rc,__LINE__,u_FILE_u)) return
          call ESMF_FieldGet(field_ctsm, farrayptr=dataptr1d, rc=rc)
          if (chkerr(rc,__LINE__,u_FILE_u)) return
          do g = begg, endg
             n = 1 + (g - begg)
             ldomain%frac(g) = dataptr1d(n)
          end do

The other big part is lnd_set_lndmask_from_maskmesh .

This part does have global memory to get the global land mask for every processor. Every processor needs it so that they can figure out the decomposition. This is one of the parts that I didn't see a way to get rid of in the decomposition code. So I don't think we can get rid of the global memory here (unless we do something with the temporary global memory for this step).

This again is ESMF code, so I don't see much we can do with it.

ekluzek avatar Sep 15 '25 21:09 ekluzek

Erik, I am confused here. I thought that the huge initialization time we were attempting to eliminate was in decomp_and_domain_from_readmesh. What impact did the MPI_scan have on that section of the code?

I'll add that I don't think I have original data on 10K cores, but per the original issue, on 40K cores it was taking ~800 seconds on average in decompInit_land, per here:

https://github.com/ESCOMP/CTSM/issues/2995#issue-2905709231

I think that's separate from the read mesh call, and if the whole of clm_init2 at ~26 seconds is inclusive of this bit, that's fantastic. Presumably this addresses the inverse-scaling we saw with respect to cores from that inner loop, so even if it doesn't improve on more, that's 800s -> 26s, which is great.

Am I understanding that correctly? I understand there are still memory things to address too, but this is very encouraging if so!

briandobbins avatar Sep 15 '25 21:09 briandobbins

It looks like I spent something on the order of a total week of my software development time over the last two weeks.

ekluzek avatar Sep 16 '25 19:09 ekluzek

OK, we are able to show that the new code reduces the initialization time spent in decompInit_lnd for the mpasa3p75 grid with 40k processors from 1972 seconds to 13.5 seconds.

ekluzek avatar Sep 23 '25 21:09 ekluzek

OK, we are able to show that the new code reduces the initialization time spent in decompInit_lnd for the mpasa3p75 grid with 40k processors from 1972 seconds to 13.5 seconds.

This is a big savings! Good job Erik. Let's merge your code changes and move on :) @briandobbins seems like this accomplishes the goal for the SIF?

wwieder avatar Sep 23 '25 22:09 wwieder

Whoa! That's awesome!

On Tue, Sep 23, 2025 at 4:32 PM will wieder @.***> wrote:

wwieder left a comment (ESCOMP/CTSM#3469) https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3325755430

OK, we are able to show that the new code reduces the initialization time spent in decompInit_lnd for the mpasa3p75 grid with 40k processors from 1972 seconds to 13.5 seconds.

This is a big savings! Good job Erik. Let's merge your code changes and move on :) @briandobbins https://github.com/briandobbins seems like this accomplishes the goal for the SIF?

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3325755430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVD3FUPJLBHOGULVXND3UHC6VAVCNFSM6AAAAACFRSDRR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMRVG42TKNBTGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dlawrenncar avatar Sep 23 '25 22:09 dlawrenncar

Most of the credit really goes to @johnmauff as he was the one that knew what to do. And that was the big key in doing this work. But, I also learned a bunch of stuff from him and will be able to use that in the future. And it's so nice to see something pan out that you think can see how it will work -- and have it actually work in practice!

ekluzek avatar Sep 23 '25 22:09 ekluzek

Hey Erik. I appreciate all of your work on the SIF project. How close are you to being able to wrap it up?

wwieder avatar Oct 06 '25 00:10 wwieder

Hey Erik. I appreciate all of your work on the SIF project. How close are you to being able to wrap it up?

I'm working towards getting the branches to a point where the work is complete and it's just turning the testing crank and responding to reviews that's left to do. I want to get to that (or close to that) by the Tuesday morning meeting where we can talk about what's left to do.

ekluzek avatar Oct 06 '25 07:10 ekluzek

0.2 weeks in sprint 23.

ekluzek avatar Oct 08 '25 16:10 ekluzek

From meeting with @johnmauff @briandobbins and @wwieder and I we decided I will cherry-pick just the updates to the decomp files (so just three source files). And bring that in soon, as a secondary priority to other work. The other testing PR's we will want to bring in, but they can be secondary to other work, and likely will need to wait until after the CESM3 release.

ekluzek avatar Oct 21 '25 18:10 ekluzek