decomp: Testing for MPI_SCAN for the decomposition
Description of changes
Experimental branch for working with using MPI_SCAN for the decomposition to get the gridcell offsets.
This is starting off as very experimental with calls to asserts to make sure it gives the same results as before.
Specific notes
Contributors other than yourself, if any: @johnmauf
CTSM Issues Fixed (include github issue #): Work for #2995 Fixes #3370
Are answers expected to change (and if so in what way)? no
Any User Interface Changes (namelist or namelist defaults changes)? No
Does this create a need to change or add documentation? Did you do so? No No
Testing performed, if any: Running the decomp_init testlist
@johnmauff and I worked on the code to get the mpi_scan testing working. I need to fill out a few more things and make sure it all works, but this was an important step to progress here.
OK, I have the new code functional now, and tried it out with mpas3p75.
Here's timing results, which does improve, clm_init2 was 42 sec.
clm_init2 10240 10240 1 26.3516 26.3311 9407 26.3628 4084
clm_init2_part1 10240 10240 1 16.9451 16.9385 9628 16.9533 4972
clm_init2_part3 10240 10240 1 8.3216 8.3195 9601 8.3246 260
clm_instInit_part1 10240 10240 1 5.9943 5.9924 9674 5.9975 260
clm_instInit_part2 10240 10240 1 2.0007 1.9890 9756 2.0082 3258
clm_instInit_part3 10240 10240 1 0.3227 0.3100 7045 0.3369 4105
clm_decompInit_clumps 10240 10240 1 0.5077 0.5067 6460 0.5094 1896
clm_decompInit_glcp 10240 10240 1 0.2091 0.2044 9420 0.2149 6575
clm_init2_snow_soil_init 10240 10240 1 0.1811 0.1800 3313 0.1812 1
clm_init2_part2 10240 10240 1 0.1556 0.1517 471 0.1598 9673
clm_init2_part5 10240 10240 1 0.0190 0.0057 7368 0.0237 1932
clm_init2_subgrid 10240 10240 1 0.0064 0.0010 7186 0.0094 4823
clm_init2_part4 10240 10240 1 0.0059 0.0011 8039 0.0102 1077
clm_init1 10240 10240 1 0.4621 0.4507 4391 0.4660 6529
And memory:
VmPeak VmSize VmLck VmPin VmHWM VmRSS VmData VmStk VmExe VmLib VmPTE vmPMD VmSwap
2611484 2611356 0 0 1263408 1057192 1509212 163980 20436 202756 4308 -1 0 [VmStatus] CTSM(Memory check): decompInit_lnd: after allocate
From the spreadsheet memory was
RSS 8516 Data 229384 Size 229384 Peak 49156
Erik,
I am confused here. I thought that the huge initialization time we were attempting to eliminate was in decomp_and_domain_from_readmesh. What impact did the MPI_scan have on that section of the code?
John
On Sun, Sep 14, 2025 at 5:52 PM Erik Kluzek @.***> wrote:
ekluzek left a comment (ESCOMP/CTSM#3469) https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3290051403
OK, I have the new code functional now, and tried it out with mpas3p75.
Here's timing results, which does improve, clm_init2 was 42 sec.
clm_init2 10240 10240 1 26.3516 26.3311 9407 26.3628 4084 clm_init2_part1 10240 10240 1 16.9451 16.9385 9628 16.9533 4972 clm_init2_part3 10240 10240 1 8.3216 8.3195 9601 8.3246 260 clm_instInit_part1 10240 10240 1 5.9943 5.9924 9674 5.9975 260 clm_instInit_part2 10240 10240 1 2.0007 1.9890 9756 2.0082 3258 clm_instInit_part3 10240 10240 1 0.3227 0.3100 7045 0.3369 4105 clm_decompInit_clumps 10240 10240 1 0.5077 0.5067 6460 0.5094 1896 clm_decompInit_glcp 10240 10240 1 0.2091 0.2044 9420 0.2149 6575 clm_init2_snow_soil_init 10240 10240 1 0.1811 0.1800 3313 0.1812 1 clm_init2_part2 10240 10240 1 0.1556 0.1517 471 0.1598 9673 clm_init2_part5 10240 10240 1 0.0190 0.0057 7368 0.0237 1932 clm_init2_subgrid 10240 10240 1 0.0064 0.0010 7186 0.0094 4823 clm_init2_part4 10240 10240 1 0.0059 0.0011 8039 0.0102 1077 clm_init1 10240 10240 1 0.4621 0.4507 4391 0.4660 6529And memory:
VmPeak VmSize VmLck VmPin VmHWM VmRSS VmData VmStk VmExe VmLib VmPTE vmPMD VmSwap2611484 2611356 0 0 1263408 1057192 1509212 163980 20436 202756 4308 -1 0 [VmStatus] CTSM(Memory check): decompInit_lnd: after allocate
From the spreadsheet memory was
RSS 8516 Data 229384 Size 229384 Peak 49156
— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3290051403, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH7NURATUZC3H2SE3Q5Y7T3SX5SVAVCNFSM6AAAAACFRSDRR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOJQGA2TCNBQGM . You are receiving this because you were mentioned.Message ID: @.***>
Erik, I am confused here. I thought that the huge initialization time we were attempting to eliminate was in decomp_and_domain_from_readmesh. What impact did the MPI_scan have on that section of the code? John
It is one of the subroutine calls in there. So it is helping.
Here's some of the timers around that subroutine and some of the sections for it with the update:
lc_lnd_set_decomp_and_domain_from_readmesh 10240 10240 1 17.3806 17.3727 4972 17.3854 9639
lnd_set_decomp_and_domain_from_readmesh: ESMF mesh 10240 10240 1 13.3492 13.3273 9617 13.6642 10233
lnd_set_lndmask_from_maskmesh 10240 10240 1 9.4614 9.4388 9617 9.7779 10233
lnd_set_decomp_and_domain_from_readmesh: decomp_init 10240 10240 1 0.7336 0.5535 6040 1.0441 68
decompInit_lnd 10240 10240 1 0.6134 0.4340 3978 0.8958 80
So decompInit_lnd is now essentially nothing, which did help.
ESMF mesh part of lnd_set_decomp_and_domain_from_readmesh:
But, the other thing is that the biggest timer for the whole subroutine is the "ESMF mesh" part (see it below here), so there isn't much (if anything?) we can do about it. The decompInit_lnd code is something we can improve and also address non-scalable memory.
field_lnd = ESMF_FieldCreate(mesh_lndinput, ESMF_TYPEKIND_R8, meshloc=ESMF_MESHLOC_ELEMENT, rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
field_ctsm = ESMF_FieldCreate(mesh_ctsm, ESMF_TYPEKIND_R8, meshloc=ESMF_MESHLOC_ELEMENT, rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
call ESMF_FieldRedistStore(field_lnd, field_ctsm, routehandle=rhandle_lnd2ctsm, &
ignoreUnmatchedIndices=.true., rc=rc)
if (chkerr(rc,__LINE__,u_FILE_u)) return
call ESMF_FieldGet(field_lnd, farrayptr=dataptr1d, rc=rc)
if (chkerr(rc,__LINE__,u_FILE_u)) return
do n = 1,size(dataptr1d)
dataptr1d(n) = lndfrac_loc_input(n)
end do
call ESMF_FieldRedist(field_lnd, field_ctsm, routehandle=rhandle_lnd2ctsm, rc=rc)
if (chkerr(rc,__LINE__,u_FILE_u)) return
call ESMF_FieldGet(field_ctsm, farrayptr=dataptr1d, rc=rc)
if (chkerr(rc,__LINE__,u_FILE_u)) return
do g = begg, endg
n = 1 + (g - begg)
ldomain%frac(g) = dataptr1d(n)
end do
The other big part is lnd_set_lndmask_from_maskmesh .
This part does have global memory to get the global land mask for every processor. Every processor needs it so that they can figure out the decomposition. This is one of the parts that I didn't see a way to get rid of in the decomposition code. So I don't think we can get rid of the global memory here (unless we do something with the temporary global memory for this step).
This again is ESMF code, so I don't see much we can do with it.
Erik, I am confused here. I thought that the huge initialization time we were attempting to eliminate was in decomp_and_domain_from_readmesh. What impact did the MPI_scan have on that section of the code?
I'll add that I don't think I have original data on 10K cores, but per the original issue, on 40K cores it was taking ~800 seconds on average in decompInit_land, per here:
https://github.com/ESCOMP/CTSM/issues/2995#issue-2905709231
I think that's separate from the read mesh call, and if the whole of clm_init2 at ~26 seconds is inclusive of this bit, that's fantastic. Presumably this addresses the inverse-scaling we saw with respect to cores from that inner loop, so even if it doesn't improve on more, that's 800s -> 26s, which is great.
Am I understanding that correctly? I understand there are still memory things to address too, but this is very encouraging if so!
It looks like I spent something on the order of a total week of my software development time over the last two weeks.
OK, we are able to show that the new code reduces the initialization time spent in decompInit_lnd for the mpasa3p75 grid with 40k processors from 1972 seconds to 13.5 seconds.
OK, we are able to show that the new code reduces the initialization time spent in decompInit_lnd for the mpasa3p75 grid with 40k processors from 1972 seconds to 13.5 seconds.
This is a big savings! Good job Erik. Let's merge your code changes and move on :) @briandobbins seems like this accomplishes the goal for the SIF?
Whoa! That's awesome!
On Tue, Sep 23, 2025 at 4:32 PM will wieder @.***> wrote:
wwieder left a comment (ESCOMP/CTSM#3469) https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3325755430
OK, we are able to show that the new code reduces the initialization time spent in decompInit_lnd for the mpasa3p75 grid with 40k processors from 1972 seconds to 13.5 seconds.
This is a big savings! Good job Erik. Let's merge your code changes and move on :) @briandobbins https://github.com/briandobbins seems like this accomplishes the goal for the SIF?
— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/pull/3469#issuecomment-3325755430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVD3FUPJLBHOGULVXND3UHC6VAVCNFSM6AAAAACFRSDRR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGMRVG42TKNBTGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Most of the credit really goes to @johnmauff as he was the one that knew what to do. And that was the big key in doing this work. But, I also learned a bunch of stuff from him and will be able to use that in the future. And it's so nice to see something pan out that you think can see how it will work -- and have it actually work in practice!
Hey Erik. I appreciate all of your work on the SIF project. How close are you to being able to wrap it up?
Hey Erik. I appreciate all of your work on the SIF project. How close are you to being able to wrap it up?
I'm working towards getting the branches to a point where the work is complete and it's just turning the testing crank and responding to reviews that's left to do. I want to get to that (or close to that) by the Tuesday morning meeting where we can talk about what's left to do.
0.2 weeks in sprint 23.
From meeting with @johnmauff @briandobbins and @wwieder and I we decided I will cherry-pick just the updates to the decomp files (so just three source files). And bring that in soon, as a secondary priority to other work. The other testing PR's we will want to bring in, but they can be secondary to other work, and likely will need to wait until after the CESM3 release.