CTSM HAS WORKAROUND: Restart fails in IHistClm60BgcCropCrujra case

Brief summary of bug

@samsrabin reported: First restart-generated lnd.log file ends with this information:

using alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test.clm2.rh0i.1958-01-23-00000.nc in current working directory
ERROR: reading variable: num2d 
ERROR in /glade/work/slevis/git/ans_chging_tags/src/main/ncdio_pio.F90.in at line 1508

General bug information

CTSM version you are using: alpha-ctsm5.4.CMIP7.09.ctsm5.3.068

Does this bug cause significantly incorrect results in the model's science? No, but restart fails under the circumstances described below.

Configurations affected: Restarts, though aux_clm was OK in /glade/derecho/scratch/slevis/tests_0815-150711de, so I need to investigate.

Details of bug

Important details of your setup / configuration so we can reproduce the bug

@samsrabin reported the problem, and I reproduced it with these cases: ~samrabin/cases_ctsm_crop_reparam/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test ~slevis/cases_dev/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test

Details of my case: ./create_newcase --compset IHistClm60BgcCropCrujra --res f10_f10_mg37 --case ~slevis/cases_dev/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test --run-unsupported

user_nl_clm (same in both cases):

hist_empty_htapes = .true.

! h0: Instantaneous annual crop variables
hist_fincl1 = 'GRAINC_TO_FOOD_PERHARV', 'GRAINC_TO_FOOD_ANN', 'SDATES', 'SDATES_PERHARV', 'SYEARS_PERHARV', 'HDATES', 'GDDHARV_PERHARV', 'GDDACCUM_PERHARV', 'HUI_PERHARV', 'SOWING_REASON_PERHARV', 'HARVEST_REASON_PERHARV'
hist_nhtfrq(1) = 17520   ! annual saves
hist_mfilt(1) = 1        ! new file every save (so annual files)
hist_type1d_pertape(1) = 'PFTS'
hist_dov2xy(1) = .false.

! (h1) Non-instantaneous (e.g. average or max) crop variables
hist_fincl2 = 'GDD0', 'GDD8', 'GDD10', 'GDD0X', 'GDD8X', 'GDD10X', 'GDD020', 'GDD820', 'GDD1020'
hist_nhtfrq(2) = 17520   ! annual saves
hist_mfilt(2) = 1        ! new file every save (so annual files)
hist_type1d_pertape(2) = 'PFTS'
hist_dov2xy(2) = .false.

env_run.xml (same in both cases):

<entry id="RUN_STARTDATE" value="1958-01-01">
<entry id="STOP_OPTION" value="ndays">
<entry id="STOP_N" value="22">
<entry id="CONTINUE_RUN" value="TRUE">
<entry id="RESUBMIT" value="1">

Aug 27 '25 23:08 slevis-lmwg

cesm.log traceback indicates histfilemod_mp_hi 4983 histFileMod.F90

Aug 27 '25 23:08 slevis-lmwg

Thanks, @slevis-lmwg. This is unfortunately blocking work on the CRU-JRA updates to the crop calendar input files (#3321), although that could be resolved if we wanted to bring those new input files in on master instead of the 5.4 alpha branch. (Although maybe we'd hit this same problem with master??)

Aug 27 '25 23:08 samsrabin

In alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test.clm2.rh1a.1958-01-23-00000.nc num2d(max_nflds) = 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0 ; In alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test.clm2.rh0i.1958-01-23-00000.nc num2d(max_nflds) = 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1 ; where max_nflds = 11 in both files

Troubleshooting ideas with initial goal to find a quick fix:

I'm trying the reverse order in the h0, h1 specifications in user_nl_clm
If it fails, I will try removing hist_empty_htapes = .true.
If it fails, I will try restarting after a full year

Aug 28 '25 00:08 slevis-lmwg

[...] (Although maybe we'd hit this same problem with master??)

I suspect so: very little difference at this point between the alpha-ctsm5.4.CMIP7 branch and master.

Aug 28 '25 00:08 slevis-lmwg

@samsrabin possibly good news... I reversed the order in the h0, h1 specifications in user_nl_clm and the test passed: /glade/derecho/scratch/slevis/alpha-ctsm5.4.CMIP7.09.ctsm5.3.068.test/run

Could you confirm that you get the same behavior with your case? If so, you can proceed with your work, and I will dig deeper into the source of this problem.

Aug 28 '25 00:08 slevis-lmwg

Random thought: Could this be related to https://github.com/ESCOMP/CTSM/issues/3404? Are those h0 fields all instantaneous? If so, then an h0a file will not be generated and maybe you'll encounter that error?

Aug 28 '25 00:08 olyson

@olyson not so random, seems very likely! Thank you for pointing out #3404 to us. It's possible I had seen it before and forgotten. The good news (for me) is that the bug predates my h0a/h0i updates (sigh of relief).

Aug 28 '25 01:08 slevis-lmwg

@samsrabin I removed "blocker" and added "next" to bring this up at tomorrow's software meeting. Now that this has come up a second time, we may decide to raise its priority. Though I see that #3404 has a Jan 7th milestone, so maybe that's still ok.

Aug 28 '25 01:08 slevis-lmwg

See suggested fix to try in #3404

Aug 28 '25 16:08 slevis-lmwg

@slevis-lmwg Have you had a chance to try that fix? If not, we can try adding a single time-averaged variable to h0 to avoid the issue entirely.

Sep 16 '25 17:09 samsrabin

I have not tried Keith's suggested fix.

Sep 16 '25 18:09 slevis-lmwg