add two compsets for all feature land BGC simulations
Two compsets I1850GSWCNPPHSWFMCROP and I1850WCCNPPHSWFMCROP are added.
[BFB]
Please also add a test for each of these to one of the land suites in tests.py.
Please also add a test for each of these to one of the land suites in tests.py. @rljacob can you point me to some instructions on adding a new test? Thanks
@rljacob @evasinha I have added the tests and checked that they are working on chrysalis.
@jinyuntang Thanks for addressing the remaining comment. I approve the PR for merge to master.
@jinyuntang I'm working to add the input files to the data server. I also think the elm history file frequency should be changed so that the tests compare elm history files ( i can do this and push to your branch)
Question : Why are the 1850 compsets setting the DATM_CLMNCEP_YR_* to 2004 and the 20TR compsets 1901? Naively, I would think it should be the other way around.
Having some issues with the tests:
ERS.ne30pg2_r05_EC30to60E2r2.I20TRWCCNPPHSWFMCROP.pm-cpu_gnu.elm-elm_wc_I20TRWCCNPPHSWFMCROP
Model datm missing file file1980 = '/global/cfs/cdirs/e3sm/inputdata/atm/datm7/v2.LR.historical_land/3hrly_drivers/v2.LR.historical_0101_land.cpl.ha2x3h.2014-12.nc'
This test thinks it needs 1,980 datm files starting with year 1850-2014. Maybe due to the mismatch i asked about above?
And runtime errors:
ERS.ne30pg2_r05_EC30to60E2r2.I20TRGSWCNPPHSWFMCROP.pm-cpu_gnu.elm-elm_gsw_I20TRGSWCNPPHSWFMCROP
76: dynpft_check_consistency mismatch between PCT_NAT_PATCH at initial time and that obtained from surface dataset 76: On landuse_timeseries file: 0.79000054621734250 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.17999953317609937 0.0000000000000000 0.0000000000000000 1.0000006914143574E-002 1.9999913692414696E-002
76: On surface dataset: 0.44006685176538030 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 1.4731156714660792E-003 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.36681831562905520 0.0000000000000000 0.0000000000000000 0.0000000000000000 0.19164171693409848
76:
76: Confirm that the year of your surface dataset
76: corresponds to the first year of your landuse_timeseries file
76: (e.g., for a landuse_timeseries file starting at year 1850, which is typical,
76: you should be using an 1850 surface dataset),
76: and that your landuse_timeseries file is compatible with the surface dataset.
76:
76: If you are confident that you are using the correct landuse_timeseries file
76: and the correct surface dataset, then you can bypass this check by setting:
76: check_dynpft_consistency = .false.
76: in user_nl_elm 76:
76: calling getglobalwrite with decomp_index= 56593 and elmlevel= gridcell
76: local gridcell index = 56593
76: global gridcell index = 31300 76: gridcell longitude = 117.75000000000000
76: gridcell latitude = -25.250000000000000
76: ENDRUN:ERROR in /global/u2/p/pschwar3/integration/E3SM/components/elm/src/dyn_subgrid/dynpftFileMod.F90 at line 174 76: ERROR: Unknown error submitted to shr_abort_abort.
@peterdschwartz, this is weird, I did not get this error when I was doing the test a few weeks ago. Let's me do a double check. It is possible some files were removed accidentally.
@peterdschwartz I double checked on chrysalis, the following script worked out the test smoothly. /home/ac.jtang/E3SMv3/code/20241003/cime/scripts/create_test ERS.ne30pg2_r05_EC30to60E2r2.I20TRWCCNPPHSWFMCROP.chrysalis_intel.elm-elm_wc_I20TRWCCNPPHSWFMCROP
Therefore, for this case, it is more likely Perlmutter does not have the data, which can be resolved by moving data from chrysalis. Do you want me to do that?
I can move the data, but if you are talking about the datm input files, the test needs to be limited to using only the files it needs. We can't allow tests that will want to download 1,980 files.
edit: also, to be clear, the runtime errors were for a different test.
@peterdschwartz I see what you meant. I will re-align the climate data, and test them on Perlmutter instead.
@peterdschwartz I now addressed the too many forcing file issue. However, running the tests on perlmutter hit the issue of insufficient memory. The message looks like " 0: slurmstepd: error: Detected 1 oom_kill event in StepId=38292148.0. Some of the step tasks have been OOM Killed. srun: error: nid006613: task 0: Out Of Memory". @ndkeen, have you encountered such error? For reference, I encountered similar error when I was running ilamb, one benchmark software for ELM on Perlmutter. Both E3SM and ilamb are OK on chrysalis.
apologies for the late reply. I'm re-running the tests with the uupdate on pm-cpu. My guess is a machine issue but i'll let you know what i find.
@peterdschwartz any update?
@jinyuntang My runs failed for different reasons - the debug runs hit a NaN or some other invalid floating point op. I have been focused on a paper submission deadline that's tonight, so i'll be able to focus on this tomorrow. Apologies for losing track.
@peterdschwartz totally understood. It is funny that hits NaN error. Also, for reference, while running ilamb on Perlmutter, I kept hitting memory error, while it is totally fine on chrysalis, and others said it was fine on compy as well. Thus, Perlmutter maybe have some weird things going on.
@jinyuntang I rebased your branch to make sure I was using latest machine file for pm-cpu. I got the out-of-memory errors. I manually adjusted one of the tests to use 4 nodes and it was able to complete successfully. I'll submit the rest as well -- queue times have been pretty long lately but hopefully know soon.
@peterdschwartz This looks promising! Thanks for the update.
The 1850 tests complete successfully, but the I20TR tests fail restart comparison. So some variable is not being written to/read in from restart file.
From "ERS.ne30pg2_r05_EC30to60E2r2.I20TRGSWCNPPHSWFMCROP.pm-cpu_intel.elm-elm_gsw_I20TRGSWCNPPHSWFMCROP
r2x_Flrr_volr (domr_nx,domr_ny,time) t_index = 1 1
3 259200 ( 499, 314, 1) ( 1, 1, 1) ( 221, 270, 1) ( 221, 270, 1)
259200 1.839830958727620E+01 0.000000000000000E+00 1.9E-02 6.613988599134110E-02 2.3E-06 6.613988599134110E-02
259200 1.839830958727620E+01 0.000000000000000E+00 8.515657790647645E-02 8.515657790647645E-02
259200 ( 499, 314, 1) ( 1, 1, 1)
avg abs field values: 1.467033410114126E-02 rms diff: 4.6E-05 avg rel diff(npos): 2.3E-06
1.467048195952900E-02 avg decimal digits(ndif): 0.7 worst: 0.7
RMS r2x_Flrr_volr 4.6261E-05 NORMALIZED 3.1534E-03
r2x_Flrr_volrmch (domr_nx,domr_ny,time) t_index = 1 1
3 259200 ( 499, 314, 1) ( 1, 1, 1) ( 221, 270, 1) ( 221, 270, 1)
259200 1.839743422883545E+01 0.000000000000000E+00 1.9E-02 6.323954114546146E-02 2.3E-06 6.323954114546146E-02
259200 1.839743422883545E+01 0.000000000000000E+00 8.225623306059679E-02 8.225623306059679E-02
259200 ( 499, 314, 1) ( 1, 1, 1)
avg abs field values: 1.436366056336904E-02 rms diff: 4.6E-05 avg rel diff(npos): 2.3E-06
1.436380842175677E-02 avg decimal digits(ndif): 0.7 worst: 0.6
RMS r2x_Flrr_volrmch 4.6261E-05 NORMALIZED 3.2207E-03
It looks like it may be a variable from MOSART. The other I20TR test shows similar differences.
@peterdschwartz Thanks for the update. I will look into the Mosart issue.
@peterdschwartz , I added the missing Mosart restart file, which should fix the problem you found. Thanks.
@jinyuntang, Why would the ERS test fail when MOSART initial condition file isn't provided? I agree with @peterdschwartz's comment that some variables must not be getting written/read on restart.
@bishtgautam , I guess it should be related to Mosart calculation of runoff, with and without initial condition, it triggers something different through different initialization. Perhaps, we should consult Hongyi, or Tian on this issue?
It looks like there are runoff variables in the restart file:
double RTM_VOLR_LIQ(rtmlat, rtmlon) ;
RTM_VOLR_LIQ:long_name = "water volume in cell (volr)" ;
RTM_VOLR_LIQ:units = "m3" ;
double RTM_VOLR_ICE(rtmlat, rtmlon) ;
RTM_VOLR_ICE:long_name = "water volume in cell (volr)" ;
RTM_VOLR_ICE:units = "m3" ;
double RTM_VOLR_MUD(rtmlat, rtmlon) ;
RTM_VOLR_MUD:long_name = "water volume in cell (volr)" ;
RTM_VOLR_MUD:units = "m3" ;
double RTM_VOLR_SAN(rtmlat, rtmlon) ;
RTM_VOLR_SAN:long_name = "water volume in cell (volr)" ;
RTM_VOLR_SAN:units = "m3" ;
double RTM_RUNOFF_LIQ(rtmlat, rtmlon) ;
RTM_RUNOFF_LIQ:long_name = "runoff (runoff)" ;
RTM_RUNOFF_LIQ:units = "m3/s" ;
double RTM_RUNOFF_ICE(rtmlat, rtmlon) ;
RTM_RUNOFF_ICE:long_name = "runoff (runoff)" ;
RTM_RUNOFF_ICE:units = "m3/s" ;
double RTM_RUNOFF_MUD(rtmlat, rtmlon) ;
RTM_RUNOFF_MUD:long_name = "runoff (runoff)" ;
RTM_RUNOFF_MUD:units = "m3/s" ;
double RTM_RUNOFF_SAN(rtmlat, rtmlon) ;
RTM_RUNOFF_SAN:long_name = "runoff (runoff)" ;
RTM_RUNOFF_SAN:units = "m3/s" ;
But I didn't see where they get used to initialize the Trunoff fields used here:
r2x_r%rattr(index_r2x_Flrr_volr,ni) = (Trunoff%wr(n,nliq) + Trunoff%wt(n,nliq)) / rtmCTL%area(n)
r2x_r%rattr(index_r2x_Flrr_volrmch,ni) = Trunoff%wr(n,nliq) / rtmCTL%area(n)
@hydrotian does this seem like the issue to you?
The issue must be with/without the initial condition MOSART for these new compsets, otherwise we would have seen these differences in other tests. Something must be getting initialized differently on restart when the MOSART initial condition file isn't specified.
cc: @hydrotian
Moving this to draft until restart is fixed.
@hydrotian, please see above.
@jinyuntang I am sharing this message since @hydrotian has not yet replied. On Tuesday, September 23rd, in the Land group meeting, @hydrotian mentioned that he does not check the GitHub notifications as they are sent to his personal email. He noted that he was going to update his email address for GitHub, but since he has not yet replied, I wonder if he is still not getting the notifications. Out of caution, I would suggest that @jinyuntang please send him an email reminder as well.
@evasinha I did send a copy through the email, in a thread @bishtgautam initiated after the land group meeting. So hopefully, @hydrotian is looking into this.
@jinyuntang and @evasinha Thanks for the reminder. I'm looking into this issue.
@jinyuntang Are all the tests successfully passed with MOSART restart file provided? If not, which test should I look into?
To respond to @peterdschwartz 's earlier comment about the MOSART restart file. The Trunoff%wt and Trunoff%wr terms are saved as RTM_WT_LIQ and RTM_WR_LIQ in the restart file. And from here the two terms is read into rtmCTL%wt and rtmCTL%wr, then pass to Trunoff from here.