New CN matrix fails with single point sites with the new ctsm5.3 datasets.
Since ctsm5.2.dev175 to ctsm5.3.0 we've been running tests with MIMICS with above ground CN matrtix that have been passing. The test is SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn. This has the soil CN matrix off (because MIMICS is non-linear), but above ground CN matrix on (use_soil_matrixcn = .false. use_matrixcn = .true.).
There are two reasons for doing this test:
- Hopefully get MIMICS to spinup faster with above ground matrix on
- More extensive testing of Matrix for an edge case where it might fail easier
The hope for "1" was especially there as we weren't finding methods to speed up the spinup of MIMICS. The test did pass for 30 tags, and just started failing in ctsm5.3.0 with the following type of error in the log files:
lnd.log:
hist_htapes_wrapup : Closing local history file ./SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn.20240923_125029_ialh14.clm2.h1.0001-01-01-28800.nc at nstep = 16
(shr_strdata_readstrm) reading file ub: /glade/campaign/cesm/cesmdata/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720.lnfm_Total_c160825.nc 7
ERROR: ERROR in /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90 at line 1246
cesm.log:
dec0996.hsn.de.hpc.ucar.edu 0: ERROR: ERROR in /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90 at line 1246
dec0996.hsn.de.hpc.ucar.edu 0: #0 0x12c3b50 in __shr_abort_mod_MOD_shr_abort_backtrace
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_abort_mod.F90:104
dec0996.hsn.de.hpc.ucar.edu 0: #1 0x12c3c13 in __shr_abort_mod_MOD_shr_abort_abort
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_abort_mod.F90:61
dec0996.hsn.de.hpc.ucar.edu 0: #2 0x131f9c8 in __shr_assert_mod_MOD_shr_assert
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/share/src/shr_assert_mod.F90.in:95
dec0996.hsn.de.hpc.ucar.edu 0: #3 0xe38814 in __sparsematrixmultiplymod_MOD_spmp_abc
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/src/utils/SparseMatrixMultiplyMod.F90:1246
dec0996.hsn.de.hpc.ucar.edu 0: #4 0x8e97db in __cnvegmatrixmod_MOD_cnvegmatrix
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNVegMatrixMod.F90:1509
dec0996.hsn.de.hpc.ucar.edu 0: #5 0x10466ef in __cndrivermod_MOD_cndriverleaching
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNDriverMod.F90:1098
dec0996.hsn.de.hpc.ucar.edu 0: #6 0x92a6b2 in __cnvegetationfacade_MOD_ecosystemdynamicspostdrainage
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/src/biogeochem/CNVegetationFacade.F90:1125
dec0996.hsn.de.hpc.ucar.edu 0: #7 0x5d7ed6 in __clm_driver_MOD_clm_drv
dec0996.hsn.de.hpc.ucar.edu 0: at /glade/work/erik/ctsm_worktrees/answer_changes/src/main/clm_driver.F90:1119
The line it fails on from above is the SHR_ASSERT_FL in this section of code in SparseMatrixMultiplyMod.F90:
if(present(num_actunit_C))then
if(num_actunit_C < 0)then
write(iulog,*) "error: num_actunit_C cannot be less than 0"
call endrun( subname//" ERROR: bad value for num_actunit_C" )
return
end if
if(.not. present(filter_actunit_C))then
write(iulog,*) "error: num_actunit_C is presented but filter_actunit_C is missing"
call endrun( subname//" ERROR: missing required optional arguments" )
return
end if
SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__)
end if
The call in CNVegMatrixMod.F90 is here:
if(num_actfirep .eq. 0 .and. nthreads < 2)then
call AKallvegc%SPMP_AB(num_soilp,filter_soilp,AKphvegc,AKgmvegc,list_ready_phgmc,list_A=list_phc_phgm,list_B=list_gmc_phgm,&
NE_AB=NE_AKallvegc,RI_AB=RI_AKallvegc,CI_AB=CI_AKallvegc)
else
call AKallvegc%SPMP_ABC(num_soilp,filter_soilp,AKphvegc,AKgmvegc,AKfivegc,list_ready_phgmfic,list_A=list_phc_phgmfi,&
list_B=list_gmc_phgmfi,list_C=list_fic_phgmfi,NE_ABC=NE_AKallvegc,RI_ABC=RI_AKallvegc,CI_ABC=CI_AKallvegc,&
use_actunit_list_C=.True.,num_actunit_C=num_actfirep,filter_actunit_C=filter_actfirep)
end if
Definition of done:
- [x] FAIL: Test if works for cold start
- [x] NO: Assess if should add a short f10 test and make sure it works
- [x] Change accordingly to what is found out from above
This is the only test we have for mimics_matrixcn. It's also possible that the tests that passed would fail if run out far enough.
Here's the note about this test when it was added.
https://github.com/ESCOMP/CTSM/pull/640#issuecomment-1074302305
I'm also doing some longer and different tests in ctsm5.2.028 to see the test just happened to pass since it was too short. As well as making sure the same test works without MIMCS.
Longer tests and tests at f10 in ctsm5.2.028 seem to be fine.
SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn SMS_D.f10_f10_mg37.I1850Clm60BgcCrop.derecho_intel.clm-mimics_matrixcn SMS_D_Lm1.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn SMS_Ly2.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn
So maybe there is something specific about this with ctsm5.3.0 datasets.
We'll mark this as an expected fail for now though.
The other test that fails in the same way is:
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
...and SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.izumi_nag.clm-default--clm-NEON-HARV--clm-matrixcnOn
My gut feeling is that these tests need new finidat files, based on past experiences where CNmatrix has crashed with one finidat and not with another (#2592).
E.g. the nearest neighbor from the finidat may not contain the right pft combinations needed for these single-point simulations.
In one of the failing tests, I changed finidat from
ctsm52026_f09_pSASU.clm2.r.0421-01-01-00000.nc
to
clmi.f19_interp_from.I1850Clm50BgcCrop-ciso.1366-01-01.0.9x1.25_gx1v7_simyr1850_c240223.nc
and the test failed in a different timestep.
Next I want to try setting finidat to the interpolated file saved in
.../tests_0923-141750de/SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn.GC.0923-141750de_gnu/run/init_generated_files/
Hmm, but that may do nothing to help. I may need to generate a new finidat for this point starting from a cold start simulation.
A broader question we wonder here (@slevis-lmwg and I) for the group to assess: (discussed at CTSM SE Oct/10th/2024)
- Should we provide IC files for single point sites? No, except NEON
- Just some (like NEON), just the ones we test for, or all? All NEON would be good. Currently process creates them outside of tags though, and may be fine for now.
- When we run matrix and run into problems like this -- do we fix it with updated IC files as a practice? Only for global grids. For single point, just change to a cold-start.
maybe matrix tests always need to start from a cold start? if you're running matrix, then by definition you're doing a spinup.
I updated the questions above, from the mornings discussion.
Troubleshooting suggests that my gut feeling was wrong.
SMS_D.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn started cold all along and it failed regardless, so I tried the following:
I turned off matrixcn and ran the case to generate a restart file. Then I turned on matrixcn and set finidat to this restart file. The simulation failed in the same line as before.
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn never started cold. I turned off matrix and generated a restart file. Then I turned on SASU and set finidat to this restart file. The simulation failed in the same line as before.
1x1 matrix tests that pass:
ERS_Lm54_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
ERS_Ly5_Mmpi-serial.1x1_smallvilleIA.I1850Clm50BgcCrop.izumi_gnu.clm-ciso_monthly--clm-matrixcnOn
ERS_Ly6_Mmpi-serial.1x1_smallvilleIA.IHistClm50BgcCropQianRs.izumi_intel.clm-cropMonthOutput--clm-matrixcnOn_ignore_warnings
ERS_Ly20_Mmpi-serial.1x1_numaIA.I2000Clm50BgcCropQianRs.izumi_intel.clm-cropMonthlyNoinitial--clm-matrixcnOn.GC.1014-115134iz_int
- A common element among the tests that pass is Clm50 and among the tests that fail Clm60 BUT our global Clm60 tests pass, so this observation may be irrelevant.
- Another difference: the two failing tests use DEBUG while the passing tests do not. Again though our global matrix tests pass regardless.
Trying a Clm6 version and Clm6 DEBUG version of the first in the above list of already passing tests:
PASS ERS_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
PASS ERS_D_Ld5_Mmpi-serial.1x1_numaIA.I2000Clm60BgcCrop.derecho_intel.clm-ciso_monthly_matrixcn_spinup
and non-DEBUG versions of the failing tests:
PASS SMS_Ld10_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.derecho_gnu.clm-default--clm-NEON-HARV--clm-matrixcnOn
PASS SMS.1x1_brazil.I1850Clm60BgcCrop.derecho_gnu.clm-mimics_matrixcn
So DEBUG must be uncovering a problem in these two. I will think about what I want to try next...
I added diagnostic write-statements just before the error gets triggered in SparseMatrixMultiplyMod.F90 line 1246:
SHR_ASSERT_FL((size(filter_actunit_C) > num_actunit_C), sourcefile, __LINE__)
and both failing tests fail when they encounter
size(filter_actunit_C) = num_actunit_C
This seems like a non-dealbreaker to me, so I changed the ASSERT to ">="
The equality gets the currently failing tests to pass without triggering other problems.
@ekluzek I will run this by you before I open a PR with this code change.
My branch is in this directory: /glade/work/slevis/git/LMWG_dev8 and open the PR with git push -u slevis-lmwg fix_1x1_matrix_fails
@slevis-lmwg that's correct the inequality should be >= rather than just >. One point there is to just make sure the array size isn't too small. The array must've been larger all the time previously. I'd have to think about why that's the case...
I'm glad you were able to figure that out.
@slevis-lmwg is this still a live issue, or was it addressed by #2840?
#2840 resolved it. I merged that to https://github.com/ESCOMP/CTSM/tree/cesm3_0_beta04_changes
The issue probably remains open because https://github.com/ESCOMP/CTSM/tree/cesm3_0_beta04_changes has not been merged to master.
Gotcha, thanks for the clarification