CTSM FATES land use v2 API update (CTSM-side)

Description of changes

Update CTSM to work with FATES land use v2 (NGEET/fates#1116) .

Based on @glemieux's analogous update for E3SM (E3SM-Project/E3SM#6353).

Specific notes

This PR enables the host land model to read in a new landuse x pft static mapping dataset from the fates landuse data tool. A default output at a 4x5 resolution is provided.

This update also includes a new ctsm-fates specific system test using the PVT prefix which provides for a 5 year spin-up in the new fates "potentival vegetation" mode the output of which is then used to start a fates landuse transient run using the landuse timeseries data (which was added back with ctsm5.1.dev160).

The fates harvest and logging options have been refactored and simplified into a new option, fates_harvest_mode, to aid the user in selecting harvest modes compatible with other fates run modes. This includes two new modes that use the area or mass harvesting data from the fates LUH2 landuse timeseries data. A new convenience namelist option, use_fates_lupft has also been provided for turning on fates landuse with no competition and fixed biogeography.

Contributors other than yourself, if any: @glemieux, @ckoven

CTSM Issues Fixed (include github issue #): None

Are answers expected to change (and if so in what way)? Yes, but for fates testmods only

Any User Interface Changes (namelist or namelist defaults changes)? Yes.

New use_fates_lupft convenience option (use_fates_luh + use_fates_nocomp + use_fates_fixedbiogeog)
use_fates_logging refactored into fates_harvest_mode with the addition of two new harvest modes

Testing performed, if any: In progress with development.

May 01 '24 21:05 samsrabin

The system test I added for the use_fates_potentialveg spin-up to transient mode run case looks to be building correctly. That said, I need to re-evaluate the potential vegetation mode checks: we want to make sure that the fluh_timeseries is not set for potential veg mode to avoid confusing the user. Currently the potential veg case run is failing because it can't find that file.

May 25 '24 00:05 glemieux

Status update: The spin-up + transient system test, PVT is working, but I'm currently tracking down an issue with the fates-side of the code failing during fates landuse patch reallocation during the transient run phase. In the course of trying to fix the issue, I've discovered that the FatesColdLUH2 test will fail with a similar error depending on the particular landuse timeseries file is being used (i.e. the file starting at year 1850 passes, but the 1650 year does not). I'm going to investigate the potential differences between the timeseries data.

That said, I think we can get review on this started since the issues appear to be either due to inputdata or on the fates-side.

May 29 '24 23:05 glemieux

fates test suite is nominal with this exception, a RUN fail in: ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdPRT2

Here is the error trace:

dec0645.hsn.de.hpc.ucar.edu 128: Abort with message NetCDF: String match to name in use in file /glade/derecho/scratch/jedwards/tmp/spack-st\
age/spack-stage-parallelio-2.6.2-q7fyefeg5lg44337zrklqh6rduj62g2m/spack-src/src/clib/pio_nc.c at line 2298
dec0645.hsn.de.hpc.ucar.edu 128: Obtained 10 stack frames.
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpioc.so(print_trace+0x36) [0x14659c445856]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpioc.so(piodie+0xa6) [0x14659c445986]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpioc.so(check_netcdf2+0x1ec) [0x14659c445bcc]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpioc.so(check_netcdf+0x2e) [0x14659c4459ce]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpioc.so(PIOc_def_var+0x862) [0x14659c46b142]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpiof.so(pio_nf_mp_def_var_md_id_+0x30c) [0x14659c6ae071]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-q7\
fyefeg5lg44337zrklqh6rduj62g2m/lib/libpiof.so(pio_nf_mp_def_var_md_desc_+0x12f) [0x14659c6add47]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/derecho/scratch/rgknox/tests_0530-081936de/ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_in\
tel.clm-FatesColdPRT2.GC.0530-081936de_int/bld/cesm.exe() [0x9ac247]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/derecho/scratch/rgknox/tests_0530-081936de/ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_in\
tel.clm-FatesColdPRT2.GC.0530-081936de_int/bld/cesm.exe() [0x9af328]
dec0645.hsn.de.hpc.ucar.edu 128: /glade/derecho/scratch/rgknox/tests_0530-081936de/ERS_D_Ld30.f45_f45_mg37.I2000Clm50FatesCruRsGs.derecho_in\
tel.clm-FatesColdPRT2.GC.0530-081936de_int/bld/cesm.exe() [0xf82601]
dec0645.hsn.de.hpc.ucar.edu 128: MPICH ERROR [Rank 128] [job id 279c24b0-28e5-43ed-8d04-fda599af1962] [Thu May 30 08:44:25 2024] [dec0645] -\
 Abort(-1) (rank 128 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128
dec0645.hsn.de.hpc.ucar.edu 128:
dec0645.hsn.de.hpc.ucar.edu 128: aborting job:
dec0645.hsn.de.hpc.ucar.edu 128: application called MPI_Abort(MPI_COMM_WORLD, -1) - process 128

May 30 '24 15:05 rgknox

@rgknox a guess from reading the error message is that possibly the same name for a field is being defined twice? So in CTSM speak there are two hist_addfld calls for the same variable name? I think that fails with something like this.

Another thing that better error messaging could help us with.

May 30 '24 17:05 ekluzek

@ckoven and I talked it through and he figured it out, the variable symbol names were too long and getting truncated, which led to it thinking a variable was defined twice.

To elaborate: we have a special mechanism that allows us to auto-generate variable names. In this case we want the same variable for carbon, nitrogen and phosphorus. So it creates the same variable name, then appends "_01" "_02" and "_03" to the end of the name string for each, respectively. Since the base name was too long, the appended string was getting truncated off, and thus the auto-generated variables were triplicating.

May 30 '24 18:05 rgknox

pushing the (fates side) fix now

May 30 '24 18:05 rgknox

@ the SE meeting Erik and Greg will review the code. Exact restart issue needs to be addressed Ryan can make the FATES PR if Greg's on PTO Adrianna can merge to CTSM master (or Sam R if she's away).

Jun 27 '24 15:06 wwieder

Regression testing against https://github.com/ESCOMP/CTSM/releases/tag/branch_tags%2Ftmp-240620.n02.ctsm5.2.007 is showing B4B results with expected DIFFs in fates testmods. That said, there is an ERP fates satellite phenology that's failing exact restart comparison that I need to track down on the fates side.

Jun 27 '24 16:06 glemieux

This pull request is now updated with the fates-side tag for land use v2. Regression testing against ctsm5.2.008 is showing mostly B4B results with expected DIFFs for fates testmods. There are two unexpected ctsm failures that I'm seeing:

FAIL LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.derecho_intel.clm-lilac MODEL_BUILD time=329
FAIL RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput RUN time=24

The RXCROP run failure looks like its just a variation on #2322, i.e. the string element for fsurdat is too long for the gddgen case. When I rerun the regression tests for the final integration testing, I'll make sure not to use a specific test-id, assuming the default system test-id is short enough.

The LILACSMOKE build failure is not clear to me yet.

Jul 09 '24 16:07 glemieux

@samsrabin @glemieux have been going over this extensively. Greg is going to do the testing for this. And then he'll pass it off for someone else to complete the tag. Since, @samrabin you are the author (and we wanted to cycle through FATES tags) we figure you should be the one to make the actual tag. This is just the final steps of the tagging process.

So start at step 16 from:

https://github.com/ESCOMP/CTSM/wiki/Protocols-on-updating-FATES-within-CTSM#fates-updates-that-include-api-changes

When I've done this myself I've typically do some double checking of the work. So I have done up to all of the following steps

Review the final CTSM changes again myself to note for any glaring problems that should kick it back for more testing
Make sure .gitmodules is pointing to a NGEET FATES tag and not a personal branch/hash
Make sure the testing baselines were run and have standard names and are on: izumi, and Derecho for both aux_clm and fates tests
Double check test results are as expected (looking at the their test cases, found either by them giving the directory names in the PR or by looking at the tail of CaseDocs/lnd_in for the directory name in the a test for the baseline)
Double check that the fates version used for the baselines is the same as in .gitmodules for the PR
Review the ChangeLog (maybe do simple updates for clarity or have author update if really needed)
Update the date in the ChangeLog if a day or more has passed (commit and push it to the PR branch)

The double checking has the intent of doing quick checking to make sure everything is good and we won't have problems later. I trust everyone on the project, but I also appreciate having my own work double checked to help prevent problems that become more involved to track down. FATES tags are also more involved than regular tags and having one person do the FATES tag/testing and another finalize the CTSM side has been a good workflow for us.

Pinging @adrifoster as she'll be doing these final steps as well. I'm adding the above steps to a CTSM SE discussion so we can settle on what we all think should be required and what can be optional (I'm thinking require up to step 3, with the later ones optional).

Jul 12 '24 18:07 ekluzek

@samsrabin @ekluzek aux_clm testing against ctsm5.2.011 on derecho is underway.

I'm going to see if I can get things going on izumi.

Jul 12 '24 23:07 glemieux

@ekluzek @samsrabin Regression testing on derecho is complete and shows B4B results against ctsm5.2.011 for all non-fates tests with one exception. The RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput test failed RUN again with the error that I saw last time:

161 2024-07-12 18:47:39: ERROR: Command /glade/u/home/glemieux/ctsm/bld/build-namelist failed rc=255
162 out=
163 err=ERROR : CLM build-namelist::CLMBuildNamelist::process_namelist_commandline_infile() : Invalid namelist variable in '-infile' /glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2507-aux_clm-final/RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput.GC.pr2507-aux_clm-final_int.gddgen/Buildconf/clmconf/namelist.
164  ERROR: in validate_variable_value (package Build::Namelist): Variable name fsurdat has a string element that is too long: '/glade/u/home/glemieux/scratch/ctsm-tests/tests_pr2507-aux_clm-final/RXCROPMATURITYSKIPGEN_Ld1097.f10_f10_mg37.IHistClm50BgcCrop.derecho_intel.clm-cropMonthOutput.GC.pr2507-aux_clm-final_int.gddgen/surfdata_10x15_hist_1850_78pfts_c240216.all_crops_everywhere.nc'

That said, I didn't see an issue for this. Is it a known issue?

Jul 13 '24 21:07 glemieux

@glemieux Looks like it's the same error as in #2322. Don't worry about it; I have this working in my current dev branch, and I strongly doubt this PR does anything to break it.

Jul 15 '24 15:07 samsrabin

@samsrabin and @glemieux maybe the only thing to do about that test is to mark it as an expected fail for this tag? Especially since it will likely be a bit before the fix can get in (based on the queue of tags).

Jul 15 '24 15:07 ekluzek

@glemieux No, it's not always an expected fail. I think Greg's use of a manual name for his test suite run is the culprit here, because it's usually okay.

Jul 15 '24 16:07 samsrabin

Testing on izumi is is largely B4B with DIFFS only showing up for fates testmods as expected. That said there was a floating invalid exception caught due to a fates-side issue that I've recorded in https://github.com/NGEET/fates/issues/1221. I've got a fix that I'll turn into a fates PR this morning. I'll rerun all tests with the update post fates merge.

Jul 15 '24 16:07 glemieux

Status update: running aux_clm regression testing on izumi and derecho

Jul 17 '24 21:07 glemieux

Regression testing aux_clm on izumi against ctsm5.0.12 is nearly complete. I'm just waiting on the 20 year crop testmod. All other expected tests are B4B aside from the expected DIFFs from the fates testmods.

Location: /scratch/cluster/glemieux/ctsm-tests/tests_0717-144328iz

UPDATE: the last testmod, the long Ly20 cropmonthly test, came back b4b.

Jul 17 '24 22:07 glemieux

Regression testing of aux_clm on derecho finished up over night. All non-fates tests are B4B. Fates test DIFFs are expected as the update incorporates a number of other fates-side answer changes. There were a few TPUTCOMP "fails" although these are all on the clm-side of things.

Location: /glade/derecho/scratch/glemieux/ctsm-tests/tests_0717-151608de

Jul 18 '24 16:07 glemieux

Woot! Thanks @glemieux @samsrabin @ekluzek @adrifoster

Jul 18 '24 23:07 ckoven