HAS WORKAROUND: Restart fails in IHistClm60BgcCropCrujra case
Brief summary of bug
@samsrabin reported: First restart-generated lnd.log file ends with this information:
using alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test.clm2.rh0i.1958-01-23-00000.nc in current working directory
ERROR: reading variable: num2d
ERROR in /glade/work/slevis/git/ans_chging_tags/src/main/ncdio_pio.F90.in at line 1508
General bug information
CTSM version you are using: alpha-ctsm5.4.CMIP7.09.ctsm5.3.068
Does this bug cause significantly incorrect results in the model's science? No, but restart fails under the circumstances described below.
Configurations affected: Restarts, though aux_clm was OK in /glade/derecho/scratch/slevis/tests_0815-150711de, so I need to investigate.
Details of bug
Important details of your setup / configuration so we can reproduce the bug
@samsrabin reported the problem, and I reproduced it with these cases: ~samrabin/cases_ctsm_crop_reparam/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test ~slevis/cases_dev/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test
Details of my case:
./create_newcase --compset IHistClm60BgcCropCrujra --res f10_f10_mg37 --case ~slevis/cases_dev/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test --run-unsupported
user_nl_clm (same in both cases):
hist_empty_htapes = .true.
! h0: Instantaneous annual crop variables
hist_fincl1 = 'GRAINC_TO_FOOD_PERHARV', 'GRAINC_TO_FOOD_ANN', 'SDATES', 'SDATES_PERHARV', 'SYEARS_PERHARV', 'HDATES', 'GDDHARV_PERHARV', 'GDDACCUM_PERHARV', 'HUI_PERHARV', 'SOWING_REASON_PERHARV', 'HARVEST_REASON_PERHARV'
hist_nhtfrq(1) = 17520 ! annual saves
hist_mfilt(1) = 1 ! new file every save (so annual files)
hist_type1d_pertape(1) = 'PFTS'
hist_dov2xy(1) = .false.
! (h1) Non-instantaneous (e.g. average or max) crop variables
hist_fincl2 = 'GDD0', 'GDD8', 'GDD10', 'GDD0X', 'GDD8X', 'GDD10X', 'GDD020', 'GDD820', 'GDD1020'
hist_nhtfrq(2) = 17520 ! annual saves
hist_mfilt(2) = 1 ! new file every save (so annual files)
hist_type1d_pertape(2) = 'PFTS'
hist_dov2xy(2) = .false.
env_run.xml (same in both cases):
<entry id="RUN_STARTDATE" value="1958-01-01">
<entry id="STOP_OPTION" value="ndays">
<entry id="STOP_N" value="22">
<entry id="CONTINUE_RUN" value="TRUE">
<entry id="RESUBMIT" value="1">
cesm.log traceback indicates histfilemod_mp_hi 4983 histFileMod.F90
Thanks, @slevis-lmwg. This is unfortunately blocking work on the CRU-JRA updates to the crop calendar input files (#3321), although that could be resolved if we wanted to bring those new input files in on master instead of the 5.4 alpha branch. (Although maybe we'd hit this same problem with master??)
In alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test.clm2.rh1a.1958-01-23-00000.nc num2d(max_nflds) = 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 ; In alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test.clm2.rh0i.1958-01-23-00000.nc num2d(max_nflds) = 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1 ; where max_nflds = 11 in both files
Troubleshooting ideas with initial goal to find a quick fix:
- I'm trying the reverse order in the h0, h1 specifications in user_nl_clm
- If it fails, I will try removing
hist_empty_htapes = .true. - If it fails, I will try restarting after a full year
[...] (Although maybe we'd hit this same problem with master??)
I suspect so: very little difference at this point between the alpha-ctsm5.4.CMIP7 branch and master.
@samsrabin possibly good news... I reversed the order in the h0, h1 specifications in user_nl_clm and the test passed: /glade/derecho/scratch/slevis/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test/run
Could you confirm that you get the same behavior with your case? If so, you can proceed with your work, and I will dig deeper into the source of this problem.
Random thought: Could this be related to https://github.com/ESCOMP/CTSM/issues/3404? Are those h0 fields all instantaneous? If so, then an h0a file will not be generated and maybe you'll encounter that error?
@olyson not so random, seems very likely! Thank you for pointing out #3404 to us. It's possible I had seen it before and forgotten. The good news (for me) is that the bug predates my h0a/h0i updates (sigh of relief).
@samsrabin I removed "blocker" and added "next" to bring this up at tomorrow's software meeting. Now that this has come up a second time, we may decide to raise its priority. Though I see that #3404 has a Jan 7th milestone, so maybe that's still ok.
See suggested fix to try in #3404
@slevis-lmwg Have you had a chance to try that fix? If not, we can try adding a single time-averaged variable to h0 to avoid the issue entirely.
I have not tried Keith's suggested fix.