E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

add two compsets for all feature land BGC simulations

Open jinyuntang opened this issue 9 months ago • 25 comments

Two compsets I1850GSWCNPPHSWFMCROP and I1850WCCNPPHSWFMCROP are added.

[BFB]

jinyuntang avatar Mar 17 '25 16:03 jinyuntang

Please also add a test for each of these to one of the land suites in tests.py.

rljacob avatar Mar 17 '25 17:03 rljacob

Please also add a test for each of these to one of the land suites in tests.py. @rljacob can you point me to some instructions on adding a new test? Thanks

jinyuntang avatar Mar 21 '25 04:03 jinyuntang

@rljacob @evasinha I have added the tests and checked that they are working on chrysalis.

jinyuntang avatar Apr 07 '25 20:04 jinyuntang

@jinyuntang Thanks for addressing the remaining comment. I approve the PR for merge to master.

evasinha-pnnl avatar Apr 08 '25 16:04 evasinha-pnnl

@jinyuntang I'm working to add the input files to the data server. I also think the elm history file frequency should be changed so that the tests compare elm history files ( i can do this and push to your branch)

Question : Why are the 1850 compsets setting the DATM_CLMNCEP_YR_* to 2004 and the 20TR compsets 1901? Naively, I would think it should be the other way around.

peterdschwartz avatar Apr 08 '25 17:04 peterdschwartz

Having some issues with the tests:

ERS.ne30pg2_r05_EC30to60E2r2.I20TRWCCNPPHSWFMCROP.pm-cpu_gnu.elm-elm_wc_I20TRWCCNPPHSWFMCROP
Model datm missing file file1980 = '/global/cfs/cdirs/e3sm/inputdata/atm/datm7/v2.LR.historical_land/3hrly_drivers/v2.LR.historical_0101_land.cpl.ha2x3h.2014-12.nc'

This test thinks it needs 1,980 datm files starting with year 1850-2014. Maybe due to the mismatch i asked about above?

And runtime errors:

ERS.ne30pg2_r05_EC30to60E2r2.I20TRGSWCNPPHSWFMCROP.pm-cpu_gnu.elm-elm_gsw_I20TRGSWCNPPHSWFMCROP
 76:  dynpft_check_consistency mismatch between PCT_NAT_PATCH at initial time and that obtained from surface dataset                                                                                              76:  On landuse_timeseries file:   0.79000054621734250        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000       0.17999953317609937        0.0000000000000000        0.0000000000000000        1.0000006914143574E-002   1.9999913692414696E-002
 76:  On surface dataset:   0.44006685176538030        0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000        1.4731156714660792E-003   0.0000000000000000        0.0000000000000000        0.0000000000000000        0.0000000000000000       0.36681831562905520        0.0000000000000000        0.0000000000000000        0.0000000000000000       0.19164171693409848
 76:
 76:  Confirm that the year of your surface dataset
 76:  corresponds to the first year of your landuse_timeseries file
 76:  (e.g., for a landuse_timeseries file starting at year 1850, which is typical,
 76:  you should be using an 1850 surface dataset),
 76:  and that your landuse_timeseries file is compatible with the surface dataset.
 76:
 76:  If you are confident that you are using the correct landuse_timeseries file
 76:  and the correct surface dataset, then you can bypass this check by setting:
 76:    check_dynpft_consistency = .false.
 76:  in user_nl_elm                                                                                                                                                                                              76:
 76:  calling getglobalwrite with decomp_index=        56593  and elmlevel= gridcell
 76:  local  gridcell index =        56593
 76:  global gridcell index =        31300                                                                                                                                                                        76:  gridcell longitude    =    117.75000000000000
 76:  gridcell latitude     =   -25.250000000000000
 76:  ENDRUN:ERROR in /global/u2/p/pschwar3/integration/E3SM/components/elm/src/dyn_subgrid/dynpftFileMod.F90 at line 174                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           76:  ERROR: Unknown error submitted to shr_abort_abort.                                                                                                                                                         

peterdschwartz avatar Apr 30 '25 15:04 peterdschwartz

@peterdschwartz, this is weird, I did not get this error when I was doing the test a few weeks ago. Let's me do a double check. It is possible some files were removed accidentally.

jinyuntang avatar Apr 30 '25 16:04 jinyuntang

@peterdschwartz I double checked on chrysalis, the following script worked out the test smoothly. /home/ac.jtang/E3SMv3/code/20241003/cime/scripts/create_test ERS.ne30pg2_r05_EC30to60E2r2.I20TRWCCNPPHSWFMCROP.chrysalis_intel.elm-elm_wc_I20TRWCCNPPHSWFMCROP

Therefore, for this case, it is more likely Perlmutter does not have the data, which can be resolved by moving data from chrysalis. Do you want me to do that?

jinyuntang avatar Apr 30 '25 22:04 jinyuntang

I can move the data, but if you are talking about the datm input files, the test needs to be limited to using only the files it needs. We can't allow tests that will want to download 1,980 files.

edit: also, to be clear, the runtime errors were for a different test.

peterdschwartz avatar May 01 '25 00:05 peterdschwartz

@peterdschwartz I see what you meant. I will re-align the climate data, and test them on Perlmutter instead.

jinyuntang avatar May 01 '25 02:05 jinyuntang

@peterdschwartz I now addressed the too many forcing file issue. However, running the tests on perlmutter hit the issue of insufficient memory. The message looks like " 0: slurmstepd: error: Detected 1 oom_kill event in StepId=38292148.0. Some of the step tasks have been OOM Killed. srun: error: nid006613: task 0: Out Of Memory". @ndkeen, have you encountered such error? For reference, I encountered similar error when I was running ilamb, one benchmark software for ELM on Perlmutter. Both E3SM and ilamb are OK on chrysalis.

jinyuntang avatar May 02 '25 22:05 jinyuntang

apologies for the late reply. I'm re-running the tests with the uupdate on pm-cpu. My guess is a machine issue but i'll let you know what i find.

peterdschwartz avatar May 08 '25 15:05 peterdschwartz

@peterdschwartz any update?

rljacob avatar May 15 '25 17:05 rljacob

@jinyuntang My runs failed for different reasons - the debug runs hit a NaN or some other invalid floating point op. I have been focused on a paper submission deadline that's tonight, so i'll be able to focus on this tomorrow. Apologies for losing track.

peterdschwartz avatar May 15 '25 17:05 peterdschwartz

@peterdschwartz totally understood. It is funny that hits NaN error. Also, for reference, while running ilamb on Perlmutter, I kept hitting memory error, while it is totally fine on chrysalis, and others said it was fine on compy as well. Thus, Perlmutter maybe have some weird things going on.

jinyuntang avatar May 15 '25 17:05 jinyuntang

@jinyuntang I rebased your branch to make sure I was using latest machine file for pm-cpu. I got the out-of-memory errors. I manually adjusted one of the tests to use 4 nodes and it was able to complete successfully. I'll submit the rest as well -- queue times have been pretty long lately but hopefully know soon.

peterdschwartz avatar May 19 '25 21:05 peterdschwartz

@peterdschwartz This looks promising! Thanks for the update.

jinyun1tang avatar May 19 '25 21:05 jinyun1tang

The 1850 tests complete successfully, but the I20TR tests fail restart comparison. So some variable is not being written to/read in from restart file. From "ERS.ne30pg2_r05_EC30to60E2r2.I20TRGSWCNPPHSWFMCROP.pm-cpu_intel.elm-elm_gsw_I20TRGSWCNPPHSWFMCROP

 r2x_Flrr_volr   (domr_nx,domr_ny,time)  t_index =      1     1
          3   259200  (   499,   314,     1) (     1,     1,     1) (   221,   270,     1) (   221,   270,     1)
              259200   1.839830958727620E+01   0.000000000000000E+00 1.9E-02  6.613988599134110E-02 2.3E-06  6.613988599134110E-02
              259200   1.839830958727620E+01   0.000000000000000E+00          8.515657790647645E-02          8.515657790647645E-02
              259200  (   499,   314,     1) (     1,     1,     1)
          avg abs field values:    1.467033410114126E-02    rms diff: 4.6E-05   avg rel diff(npos):  2.3E-06
                                   1.467048195952900E-02                        avg decimal digits(ndif):  0.7 worst:  0.7
 RMS r2x_Flrr_volr                    4.6261E-05            NORMALIZED  3.1534E-03

 r2x_Flrr_volrmch   (domr_nx,domr_ny,time)  t_index =      1     1
          3   259200  (   499,   314,     1) (     1,     1,     1) (   221,   270,     1) (   221,   270,     1)
              259200   1.839743422883545E+01   0.000000000000000E+00 1.9E-02  6.323954114546146E-02 2.3E-06  6.323954114546146E-02
              259200   1.839743422883545E+01   0.000000000000000E+00          8.225623306059679E-02          8.225623306059679E-02
              259200  (   499,   314,     1) (     1,     1,     1)
          avg abs field values:    1.436366056336904E-02    rms diff: 4.6E-05   avg rel diff(npos):  2.3E-06
                                   1.436380842175677E-02                        avg decimal digits(ndif):  0.7 worst:  0.6
 RMS r2x_Flrr_volrmch                 4.6261E-05            NORMALIZED  3.2207E-03

It looks like it may be a variable from MOSART. The other I20TR test shows similar differences.

peterdschwartz avatar May 21 '25 14:05 peterdschwartz

@peterdschwartz Thanks for the update. I will look into the Mosart issue.

jinyun1tang avatar May 21 '25 15:05 jinyun1tang

@peterdschwartz , I added the missing Mosart restart file, which should fix the problem you found. Thanks.

jinyun1tang avatar May 21 '25 16:05 jinyun1tang

@jinyuntang, Why would the ERS test fail when MOSART initial condition file isn't provided? I agree with @peterdschwartz's comment that some variables must not be getting written/read on restart.

bishtgautam avatar May 23 '25 06:05 bishtgautam

@bishtgautam , I guess it should be related to Mosart calculation of runoff, with and without initial condition, it triggers something different through different initialization. Perhaps, we should consult Hongyi, or Tian on this issue?

jinyun1tang avatar May 23 '25 15:05 jinyun1tang

It looks like there are runoff variables in the restart file:

  double RTM_VOLR_LIQ(rtmlat, rtmlon) ;
                RTM_VOLR_LIQ:long_name = "water volume in cell (volr)" ;
                RTM_VOLR_LIQ:units = "m3" ;
        double RTM_VOLR_ICE(rtmlat, rtmlon) ;
                RTM_VOLR_ICE:long_name = "water volume in cell (volr)" ;
                RTM_VOLR_ICE:units = "m3" ;
        double RTM_VOLR_MUD(rtmlat, rtmlon) ;
                RTM_VOLR_MUD:long_name = "water volume in cell (volr)" ;
                RTM_VOLR_MUD:units = "m3" ;
        double RTM_VOLR_SAN(rtmlat, rtmlon) ;
                RTM_VOLR_SAN:long_name = "water volume in cell (volr)" ;
                RTM_VOLR_SAN:units = "m3" ;
        double RTM_RUNOFF_LIQ(rtmlat, rtmlon) ;
                RTM_RUNOFF_LIQ:long_name = "runoff (runoff)" ;
                RTM_RUNOFF_LIQ:units = "m3/s" ;
        double RTM_RUNOFF_ICE(rtmlat, rtmlon) ;
                RTM_RUNOFF_ICE:long_name = "runoff (runoff)" ;
                RTM_RUNOFF_ICE:units = "m3/s" ;
        double RTM_RUNOFF_MUD(rtmlat, rtmlon) ;
                RTM_RUNOFF_MUD:long_name = "runoff (runoff)" ;
                RTM_RUNOFF_MUD:units = "m3/s" ;
        double RTM_RUNOFF_SAN(rtmlat, rtmlon) ;
                RTM_RUNOFF_SAN:long_name = "runoff (runoff)" ;
                RTM_RUNOFF_SAN:units = "m3/s" ;

But I didn't see where they get used to initialize the Trunoff fields used here:

       r2x_r%rattr(index_r2x_Flrr_volr,ni)    = (Trunoff%wr(n,nliq) + Trunoff%wt(n,nliq)) / rtmCTL%area(n)
       r2x_r%rattr(index_r2x_Flrr_volrmch,ni) = Trunoff%wr(n,nliq) / rtmCTL%area(n)

@hydrotian does this seem like the issue to you?

peterdschwartz avatar May 23 '25 17:05 peterdschwartz

The issue must be with/without the initial condition MOSART for these new compsets, otherwise we would have seen these differences in other tests. Something must be getting initialized differently on restart when the MOSART initial condition file isn't specified.

cc: @hydrotian

bishtgautam avatar May 23 '25 17:05 bishtgautam

Moving this to draft until restart is fixed.

rljacob avatar Jun 03 '25 15:06 rljacob

@hydrotian, please see above.

jinyuntang avatar Sep 25 '25 00:09 jinyuntang

@jinyuntang I am sharing this message since @hydrotian has not yet replied. On Tuesday, September 23rd, in the Land group meeting, @hydrotian mentioned that he does not check the GitHub notifications as they are sent to his personal email. He noted that he was going to update his email address for GitHub, but since he has not yet replied, I wonder if he is still not getting the notifications. Out of caution, I would suggest that @jinyuntang please send him an email reminder as well.

evasinha-pnnl avatar Sep 25 '25 15:09 evasinha-pnnl

@evasinha I did send a copy through the email, in a thread @bishtgautam initiated after the land group meeting. So hopefully, @hydrotian is looking into this.

jinyuntang avatar Sep 25 '25 15:09 jinyuntang

@jinyuntang and @evasinha Thanks for the reminder. I'm looking into this issue.

hydrotian avatar Sep 25 '25 15:09 hydrotian

@jinyuntang Are all the tests successfully passed with MOSART restart file provided? If not, which test should I look into?

To respond to @peterdschwartz 's earlier comment about the MOSART restart file. The Trunoff%wt and Trunoff%wr terms are saved as RTM_WT_LIQ and RTM_WR_LIQ in the restart file. And from here the two terms is read into rtmCTL%wt and rtmCTL%wr, then pass to Trunoff from here.

hydrotian avatar Sep 25 '25 22:09 hydrotian