NorESM icon indicating copy to clipboard operation
NorESM copied to clipboard

Methane conservation error in CH4Mod doing a hybrid restart with CISM%EVOLVE over Greenland

Open mpetrini-norce opened this issue 4 months ago • 4 comments

Describe the bug I am doing a hybrid restart to turn on CISM over Greenland, branching from year 101 of the picontrol run n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716(https://github.com/NorESMhub/noresm3_dev_simulations/issues/194). The model run for one year, then crashes right after restarting due to a methane conservation error in in CH4Mod.F90. I've tried to repeat the same procedure without turning on CISM (i.e., doing a hybrid restart with the exact same setup as in n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716), and in this case the model runs 5 years with no issues.

  • NorESM version: noresm3_0_beta01
  • HPC platform: betzy
  • Compiler: intel
  • Compset: 1850_CAM70%LT%NORESM%CAMoslo_CLM60%FATES_CICE_BLOM%HYB%ECO_MOSART_CISM2%GRIS-EVOLVE_SWAV_SESP
  • Resolution: ne30pg3_tn14_gris4
  • Error message: From cesm.log: Gridcell-level CH4 Conservation Error in CH4Mod driver From ESMF_log: Methane conservation errorERROR in ch4Mod.F90 at line 2352

To Reproduce Steps to reproduce the behavior:

  1. Get noresm3_0_beta01

  2. /cluster/projects/nn11022k/mpet/NorESM/Repository/noresm3_0_beta01/cime/scripts/create_newcase --case "${CASEDIR}" --compset 1850_CAM70%LT%NORESM%CAMoslo_CLM60%FATES_CICE_BLOM%HYB%ECO_MOSART_CISM2%GRIS-EVOLVE_SWAV_SESP --res ne30pg3_tn14_gris4 --machine betzy --project nn11022k --q normal --walltime 48:00:00 --pecount L --run-unsupported --compiler intel --user-mods-dir /cluster/projects/nn11022k/mpet/NorESM/Repository/noresm3_0_beta01/cime_config/usermods_dirs/reduced_out_devsim/

  3. ./xmlchange RUN_TYPE=hybrid ./xmlchange RUN_REFDIR=/cluster/projects/nn11022k/mpet/cmip7_testrestart_grisonly/restarts/n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716/0101-01-01-00000 ./xmlchange RUN_REFCASE=n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716 ./xmlchange RUN_REFDATE=0101-01-01 ./xmlchange RUN_STARTDATE=0101-01-01 ./xmlchange STOP_N=5 ./xmlchange STOP_OPTION=nyears ./xmlchange REST_OPTION=nyears ./xmlchange REST_N=1 ./xmlchange GLC_AVG_PERIOD=yearly

    ./case.setup ./case.build

  4. Set cpl, cam, and clm namelists as in https://github.com/NorESMhub/noresm3_dev_simulations/issues/194. Set the following cism namelist: #CISM-Greenland-only cisminputfile = '/cluster/projects/nn11022k/mpet/dataset/new_cism_grids/gris/inputfiles/Greenland_4km.init.c27022025.nc' nsn=721 ewn=421 adjust_input_thickness = .false. bmlt_float = 6 bmlt_float_thermal_forcing_param = 0 bmlt_float_ismip6_magnitude = 1 isostasy = 0 limit_marine_cliffs = .false. marine_margin = 1 calving_minthck = 100. calving_timescale = 1 ocean_data_domain = 2 ocean_data_extrapolate = 1 remove_icebergs = .true. remove_isthmuses = .false. flow_factor_float = 1.0 gamma0 = 0 block_inception = .true. force_retreat = 1 restart = 0 nzocn = 30 dzocn = 60. esm_history_vars = "smb artm thk usurf topg uvel vvel temp bmlt bwat beta_internal floating_mask grounded_mask bpmp acab_applied bmlt_applied calving_rate iareaf iareag imass imass_above_flotation total_smb_flux total_bmb_flux total_calving_flux total_gl_flux ice_sheet_mask ice_cap_mask thermal_forcing thermal_forcing_lsrf" dt = 0.1 dt_diag = 0.1 EOF

Case folder: /cluster/projects/nn11022k/mpet/cmip7_testrestart_grisonly/n1850.ne30_tn14_gl4_testrestart1 Output: /cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/run

@hgoelzer @mvdebolskiy @mvertens @gold2718

mpetrini-norce avatar Aug 04 '25 13:08 mpetrini-norce

Hi @mpetrini-norce, The error is coming from the land methane conservation, however, we've had this type of errors before (with the extreme aerosol burst bug), and then it was really nothing todo with methane conservation really, only that the methane conservation code is the first to complain about non-sensical outputs (like negative forcing etc...) That doesn't mean that I know what is causing this, but it may in fact be CISM even if you don't think so from the error messages...

maritsandstad avatar Aug 04 '25 14:08 maritsandstad

@mpetrini-norce I do not have access to your case folder, however, judging by the logs, the model have run 1 year, wrote a restart file and restarted.

Can you dump the CaseStatus here?

mvdebolskiy avatar Aug 06 '25 12:08 mvdebolskiy

Thanks @maritsandstad and @mvdebolskiy for the replies. I've copied the case folder here /cluster/projects/nn9560k/mpet/n1850.ne30_tn14_gl4_testrestart1. Below the CaseStatus:

2025-07-31 13:57:19: xmlchange success ./xmlchange --force BLOM_OUTPUT_SIZE=spinup 2025-07-31 13:57:19: xmlchange success ./xmlchange --force HAMOCC_OUTPUT_SIZE=spinup 2025-07-31 13:57:20: xmlchange success ./xmlchange RUN_TYPE=hybrid 2025-07-31 13:57:20: xmlchange success ./xmlchange RUN_REFDIR=/cluster/projects/nn11022k/mpet/cmip7_testrestart_grisonly/restarts/n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716/0101-01-01-00000 2025-07-31 13:57:20: xmlchange success ./xmlchange RUN_REFCASE=n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716 2025-07-31 13:57:21: xmlchange success ./xmlchange RUN_REFDATE=0101-01-01 2025-07-31 13:57:21: xmlchange success ./xmlchange RUN_STARTDATE=0101-01-01 2025-07-31 13:57:21: xmlchange success ./xmlchange STOP_N=5 2025-07-31 13:57:22: xmlchange success ./xmlchange STOP_OPTION=nyears 2025-07-31 13:57:22: xmlchange success ./xmlchange REST_OPTION=nyears 2025-07-31 13:57:22: xmlchange success ./xmlchange REST_N=1 2025-07-31 13:57:23: xmlchange success ./xmlchange CONTINUE_RUN=FALSE 2025-07-31 13:57:23: xmlchange success ./xmlchange --subgroup case.st_archive JOB_WALLCLOCK_TIME=03:00:00 2025-07-31 13:57:23: xmlchange success ./xmlchange GLC_AVG_PERIOD=yearly 2025-07-31 13:57:24: xmlchange success ./xmlchange BUDGETS=TRUE 2025-07-31 13:57:24: case.setup starting 2025-07-31 13:57:35: case.setup success 2025-07-31 13:57:35: local git repository created 2025-07-31 13:59:23: case.build starting 2025-07-31 14:14:27: case.build success 2025-07-31 15:00:56: case.submit starting 1182517 2025-07-31 15:00:57: case.submit success 1182517 2025-07-31 15:01:03: case.run starting 1182515 2025-07-31 15:01:14: model execution starting 1182515 2025-07-31 19:36:58: model execution error ERROR: Command: 'srun --kill-on-bad-exit --label /cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/run' 2025-07-31 19:36:59: case.run error ERROR: RUN FAIL: Command 'srun --kill-on-bad-exit --label /cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed See log file for details: /cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/run/cesm.log.1182515.250731-150103

Here I tried to manually restart the run, but failed with the same error message. So you're right Matvey, it crashes after restarting. I'll correct.

2025-07-31 20:24:53: xmlchange success ./xmlchange CONTINUE_RUN=TRUE 2025-07-31 20:24:56: case.submit starting 2025-07-31 20:24:56: case.submit error ERROR: CONTINUE_RUN is true but this case does not appear to have restart files staged in /cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/run rpointer.cpl 2025-07-31 20:53:13: xmlchange success ./xmlchange CONTINUE_RUN=TRUE 2025-07-31 20:53:39: case.submit starting 1182648 2025-07-31 20:53:40: case.submit success 1182648 2025-07-31 20:53:45: case.run starting 1182646 2025-07-31 20:53:59: model execution starting 1182646 2025-07-31 20:58:00: model execution error ERROR: Command: 'srun --kill-on-bad-exit --label /cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed with error '' from dir '/cluster/work/users/mpet/noresm/n1850.ne30_tn14_gl4_testrestart1/run'

mpetrini-norce avatar Aug 06 '25 13:08 mpetrini-norce

Update on this bug, found a temporary fix:

The model completed a 5 years run with GrIS active and methane turned off (use_lch4 = .false.), however the same crash occurs when methane is subsequently turned on in another hybrid run restarting from GrIS_active-methane_off. @mvdebolskiy noted that the conservation errors become smaller when running for longer time (15 years) with methane turned off, so one option could be to extend the GrIS_active-methane_off run to see if at some point the error goes away.

Another strategy that is working for now, but probably not ideal in the long-term, is to use a patch similar to the case with lake area changes (see https://github.com/ESCOMP/CTSM/issues/43): that is, to skip the methane conservation check if dynamic glaciers are on and we are at the beginning of the year or at the beginning of a simulation (code below at line 2327 in biogeochem/ch4Mod.F90):

Image

With this patch, the model could complete a 5 years run with GrIS active and methane turned on, restarting from the n1850.ne30_tn14.nor3_b01-cplhist-noLU.20250716 run (https://github.com/NorESMhub/noresm3_dev_simulations/issues/194). More discussion will follow to understand if we can find a cleaner fix (@mvdebolskiy will keep looking into that) and/or if this patch is acceptable as it was for the lake area changes case.

mpetrini-norce avatar Aug 22 '25 11:08 mpetrini-norce