NorESM alpha02 not reproducible
Created two cases with compset NF2000 and ne30pg3_ne30pg3_mtn14 with tag noresm3_0_alpha02 using ../cime/scripts/create_newcase --case nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1 --compset 2000_CAM70%LT%NORESM%CAMoslo_CLM60%FATES-SP_CICE%PRES_DOCN%DOM_MOSART_DGLC%NOEVOLVE_SWAV--res ne30pg3_ne30pg3_mtn14 --project nn9560k --run-unsupported --mach betzy
First case nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2 was run for 1+1 year and Second one nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1 run for 2 years continuously. I checked *_in file and seems to be same and I used ncdiff
one month diff: :that is bitwise same ncdiff nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1/atm/hist/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1.cam.h0a.0001-01.nc nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2/atm/hist/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2.cam.h0a.0001-01.nc -o diff_twostream_m1.nc
two years diff: :that is bitwise not same ncdiff nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1/atm/hist/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1.cam.h0a.0002-12.nc nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2/atm/hist/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2.cam.h0a.0002-12.nc -o diff_twostream.nc" but these are not be same.
Case folders: /cluster/work/users/agu002/NorESM3_alpha2/cases/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1 /cluster/work/users/agu002/NorESM3_alpha2/cases/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2
output data folder:- /cluster/work/users/agu002/archive/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1 /cluster/work/users/agu002/archive/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2
diff_twostream.nc and diff_twostream_m1.nc location:
/cluster/work/users/agu002/archive/
It would be nice to check if I am not doing something wrong.
user_nl_cam use_aerocom = .true. interpolate_nlat = 192 interpolate_nlon = 288 interpolate_output = .true. history_aerosol = .true. zmconv_c0_lnd = 0.0075D0 zmconv_c0_ocn = 0.0300D0 zmconv_ke = 5.0E-6 zmconv_ke_lnd = 1.0E-5 clim_modal_aero_top_press = 1.D-4 bndtvg = '/cluster/shared/noresm/inputdata/atm/cam/ggas/noaamisc.r8.nc' micro_mg_dcs = 600.D-6
@monsieuralok - thanks for raising this issue. We are not testing 1+1 year restart as part of the standard test procedure, so it is possible that this error has gone undetected for some time.
@TomasTorsvik Today I have checked CESM compset with alpha02 2000_CAM70%LT_CLM60%SP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV which seems to bit-wise reproducible using 1+1 year.
@monsieuralok - thanks for checking! So, does this mean that the error is connected either to CAMoslo or to FATES?
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu CREATE_NEWCASE
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu XML
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu SETUP
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu SHAREDLIB_BUILD time=128
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu MODEL_BUILD time=124
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu SUBMIT
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu RUN time=72020
FAIL ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu COMPARE_base_rest
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu MEMLEAK
PASS ERS_Ld766.ne30pg3_tn14.N1850fates-nocomp.betzy_gnu SHORT_TERM_ARCHIVER
Year boundary seems to break things.
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp CREATE_NEWCASE
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp XML
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp SETUP
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp SHAREDLIB_BUILD time=114
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp MODEL_BUILD time=28
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp SUBMIT
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp RUN time=7299
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp COMPARE_base_rest
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp MEMLEAK
PASS ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp SHORT_TERM_ARCHIVER
Land standalone passes.
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu CREATE_NEWCASE
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu XML
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu SETUP
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu SHAREDLIB_BUILD time=140
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu MODEL_BUILD time=125
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu SUBMIT
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu RUN time=68238
FAIL ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu COMPARE_base_rest
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu MEMLEAK
PASS ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu SHORT_TERM_ARCHIVER
Just confirmed that fates is not the one causing the test fail.
Thanks @mvdebolskiy!
@mvdebolskiy - in looking at the cprnc output for your failed test -
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu - I see the following (see below). This would point to a non-bfb difference in diagnostic output that does not effect the other components.
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.blom.hbgcd.0003-02-05.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.blom.hbgcm.0003-01.nc.base.cprnc.out: of which 0 had non-zero differences
**ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.blom.hbgcy.0002.nc.base.cprnc.out: of which 6 had non-zero differences**
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.blom.hd.0003-02-05.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.blom.hm.0003-01.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.blom.hy.0002.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.cam.h0a.0003-01.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.cam.h0i.0003-02-01-00000.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.cam.i.0003-01-01-00000.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.cice.h.0003-01.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.clm2.h0.0003-01.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.cpl.hi.0003-02-06-00000.nc.base.cprnc.out: of which 0 had non-zero differences
ERS_Ld766.ne30pg3_tn14.N1850.betzy_gnu.20250418_121629_bhe3k0.mosart.h0.0003-01.nc.base.cprnc.out: of which 0 had non-zero differences
@mvdebolskiy - several points:
- I am noticing with your test is that you are not writing restarts at the yearly boundary for Ld766. Your restart is at 0002-01-20-00000. So if the issue is restarting exactly at the yearly boundary - the above ERS_Ld766 test does not capture that problem.
- The N1850 compset does not user FATES - it is
1850_CAM70%LT%NORESM%CAMoslo_CLM60%SP_CICE_BLOM%HYB%ECO_MOSART_DGLC%NOEVOLVE_SWAV_SESPSo apart from the fill diffs in BLOM diagnostics - the results are bfb across the year boundary - as my tests for NF2000 show below - My comment about the year boundary restart applies also to the I compset test - ERS_Ld766.ne30pg3_tn14.I2000Clm60Fates.betzy_gnu.clm-FatesColdNoComp
I have done a simple test using the following (see below) and obtained bfb results with restarts across the year boundary
./create_newcase --case /cluster/home/mvertens/noresm/nf2000_restart_test --compset 2000_CAM70%LT%NORESM%CAMoslo_CLM60%SP_CICE%PRES_DOCN
%DOM_MOSART_DGLC%NOEVOLVE_SWAV --res ne30pg3_ne30pg3_mtn14 --project nn9560k --run-unsupported --mach betzy
with the following changes
./xmlchange RUN_STARTDATE: 0001-12-01
./xmlchange REST_N=1
./xmlchange REST_OPTION=nmonths
./xmlchange HIST_N=1
./xmlchange HIST_OPTION=ndays
and modified user_nl_cam to have
interpolate_nlat = 192
interpolate_nlon = 288
interpolate_output = .true.
!history_aerosol = .true.
zmconv_c0_lnd = 0.0075D0
zmconv_c0_ocn = 0.0300D0
zmconv_ke = 5.0E-6
zmconv_ke_lnd = 1.0E-5
clim_modal_aero_top_press = 1.D-4
bndtvg = '/cluster/shared/noresm/inputdata/atm/cam/ggas/noaamisc.r8.nc'
micro_mg_dcs = 600.D-6
I have also verified that with these settings I can do a 14 month run and write restarts every month - and then selectively restart at the yearly boundary and the results are bfb with the initial run.
From the above - I do not think that oslo-aero is causing the restart problems that @monsieuralok observed.
@monsieuralok @mvdebolskiy - I cannot duplicate the problem that @monsieuralok found. I used the exact same setup and the only difference I had was that use_aerocom = .false. - however this is only a diagnostic and should not effect the answers. I also wrote coupler history files monthly. A less expensive way to do this test is to do a 13 month run and write restarts at the year boundary (I actually wrote restarts monthly just as a sanity check). The simply do a 1 month restart run starting from the year boundary. Just set the following variables:
./xmlchange DRV_RESTART_POINTER=rpointer.cpl.0002-01-01-00000
./xmlchange CONTINUE_RUN=TRUE
./xmlchange STOP_N=1
./xmlchange STOP_OPTION=nmonths
My restart at the year boundary was bfb with the initial run.
The case directory is /cluster/work/users/mvertens/noresm/nf2000_restart_test3
The run directory is /cluster/work/users/mvertens/noresm/nf2000_restart_test3/run
@monsieuralok @mvdebolskiy - I have also now verified that with use_aerocom = .true. restarts are bfb across the year boundary using the exact same case that @monsieuralok ran.
@mvertens I have executed three tests one fully coupled and two ATM-LAND only (one with NorESM and one CESM compset) on Betzy:
PATH:- /cluster/work/users/agu002/NorESM3_alpha2/cases/
Fully coupled:-
couple_year1_1
couple_year2
CESM:-
ne30pg3_ne30pg3_mtn14_cesm1
ne30pg3_ne30pg3_mtn14_cesm2
ne30pg3_ne30pg3_mtn14_cesm3 (clone of ne30pg3_ne30pg3_mtn14_cesm1 to test clone functionality)
NorESM:-
nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1
nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2
except CESM, others two are having quite huge differences
data in archive folder:- /cluster/work/users/agu002/archive/
@mvertens do you have the data from both run ? As, in run folder it is only last data that is extended for 2 months. It could be then some random initialization issue that by chance I am facing in every run.
@monsieuralok : I am only comparing the two ATM-LAND runs at this point - not the fully coupled. So the comparison should be with your cases
/cluster/work/users/agu002/NorESM3_alpha2/cases/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1
/cluster/work/users/agu002/NorESM3_alpha2/cases/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T2
I am also using more processors than you did. In comparing the above CaseDocs with my CaseDocs in
/cluster/work/users/agu002/NorESM3_alpha2/cases/nf2000_ne30pg3_ne30pg3_mtn14.fatessp_A2T1 U see no differences other than those from the PEcount and other history/restart settings.
The data from my initial 14 month run is in
/cluster/work/users/mvertens/noresm/nf2000_restart_test3/run/output_init2
The data from the restart run starting at the year boundary is in /cluster/work/users/mvertens/noresm/nf2000_restart_test3/run/`
Since your restarts differ immediately after the new year - I don't see the reason to do two 1 year runs. Its easier just to restart at the year boundary - which I outlined above.
@gold2718 @TomasTorsvik @matsbn As, I mentioned I will try CESM on 2700 Procs same as NorESM; CESM is BFB reproducible.
@mvertens
This test fails /cluster/work/users/mdeb/noresm/ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu.20250424_101807_c9pvsj
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu CREATE_NEWCASE
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu XML
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu SETUP
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu SHAREDLIB_BUILD time=130
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu MODEL_BUILD time=121
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu SUBMIT
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu RUN time=9050
FAIL ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu COMPARE_base_rest
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu MEMLEAK
PASS ERS_Ld62.ne30pg3_tn14.NF2000.betzy_gnu SHORT_TERM_ARCHIVER
cprnc.out for cice.h.:
RMS time 5.0000E-01 NORMALIZED 1.1050E-02
RMS time_bounds 7.0711E-01 NORMALIZED 1.5627E-02
RMS snowfrac 1.2747E-03 NORMALIZED 1.0925E-02
RMS atmspd 9.6272E-02 NORMALIZED 1.2542E-02
RMS atmdir 2.3337E+00 NORMALIZED 1.4146E-02
RMS fswup 1.8311E+00 NORMALIZED 9.8734E-03
cprnc.out for cpl.hi:
RMS iceImp_Fioi_swpen_vdf 1.2344E-02 NORMALIZED 1.4856E-01
RMS iceImp_Fioi_swpen_vdr 8.4754E-03 NORMALIZED 7.1541E-01
This is with alpha03. default pelayouts.
@mvdebolskiy - did this test pass with alpha02a? Does the test with intel pass?
@mvertens judging by the fact it's CICE, I think 2_5_alpha09 and 3_0_alpha01 should be checked. Will try later.
@mvdebolskiy - the normal NF test uses the ice/ocean on the atm/lnd grid - ne30pg3_ne30pg3_mtn14 NOT ne30pg3_tn14. You can see this in all of the nf tests above here.
I think it would be important to run the normal test first.
Oh, ok.
That said - it is interesting that the ne30pg3_tn14 does not pass restart. I would guess a very short test would show this as well. I'm not sure it's the highest priority right now to investigate this - but it would be good to track this down at some point.
@mvdebolskiy - ERS_Ld11.ne30pg3_ne30pg3_mtn14.NF2000.betzy_gnu passed with alpha03. I don't think the time length difference is a problem.
@mvertens I think month/year boundary is the issue. Since tests less than a month are always passing for all the versions and are in the test-suite.
@mvdebolskiy - that was my initial thought. But I have verified restarts are bfb when I restart from the year boundary - so I'm confused as to how this could be a month/year boundary issue.
alpha03 tag still not reproducible for some simulations:-
Experiment directory:- /cluster/projects/nn9560k/alok/cases_noresm3
Experiments:- nf2000_ne30pg3_ne30pg3_mtn14_T1(1+1 year) nf2000_ne30pg3_ne30pg3_mtn14_T3 (2 years simulation)
I get a restart error for noresm3_0_beta03b. I suspect it is the same as reported here, although my setup is a bit different. I run versions of the setup from noresm3_dev_simulations#244, except that I check out noresm3_0_beta03b instead of beta03a.
- case directory:
/cluster/projects/nn9560k/tomast/NorESM/cases/n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021: 3 year runn1850.ne16_tn14.noresm3_0_beta03b-run2_yr001.20251021: 1+1+1 year runn1850.ne16_tn14.noresm3_0_beta03b-run3_yr002.20251021: branch from run2, 1 year --> 2 year runn1850.ne16_tn14.noresm3_0_beta03b-run4_yr002.20251021: branch from run2, 1 year --> 1+1 year
- run directory:
/cluster/work/users/tomast/noresm/<casename> - archive directory:
/cluster/work/users/tomast/archive/<casename>
run1andrun3have IDENTICAL outputrun1andrun2have IDENTICAL output in years 1 and 2, but DIFFERENT output in year 3run1andrun4have IDENTICAL output in years 1 and 2, but DIFFERENT output in year 3run2andrun4have IDENTICAL output
Now I get error without restart as well, at the year boundary between year 1 -> 2
- case directory:
/cluster/projects/nn9560k/tomast/NorESM/cases/n1850.ne16_tn14.noresm3_0_beta03b-run5_yr001.20251022: 2 year run
run1andrun5have IDENTICAL output until0001-12, but DIFFERENT output from0002-01.
A curious difference between run1 and run5 is that the file handle id changes at the year boundary.
run1: cesm.log
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.h1i.0001-12-28-00000.
1: nc to write 285
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.h1a.0001-12-28-00000.
1: nc to write 286
0: max rss=996560896.0 MB
...
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.h0i.0002-01-01-00000.
1: nc to write 289
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.h0a.0001-12.nc
1: to write 290
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.i.0002-01-01-00000.nc
1: to write 291
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.r.0002-01-01-00000.nc
1: to write 292
1: Opening existing file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.h1i.0001-12-28-00000.
1: nc 285
1: Opening existing file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.h1a.0001-12-28-00000.
1: nc 286
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run1_yr001.20251021.cam.rs.0002-01-01-00000.n
1: c to write 295
0: max rss=996560896.0 MB
run5: cesm.log
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run5_yr001.20251022.cam.h1i.0001-12-28-00000.
1: nc to write 285
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run5_yr001.20251022.cam.h1a.0001-12-28-00000.
1: nc to write 286
0: max rss=992931840.0 MB
...
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run5_yr001.20251022.cam.h0i.0002-01-01-00000.
1: nc to write 288
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run5_yr001.20251022.cam.h0a.0001-12.nc
1: to write 289
1: Opened file
1: n1850.ne16_tn14.noresm3_0_beta03b-run5_yr001.20251022.cam.i.0002-01-01-00000.nc
1: to write 290
0: max rss=992931840.0 MB
I have not been able to reproduce the errors from the firs 4 runs. I have now run a total of 14 runs, which all give the same output as run5. I suppose there could have been some Betzy system issue on 2025-10-21, since this was the date when runs were inconsistent. However, there was no error message to indicate that something went wrong in these runs, which is worrying.
I created some difference files in
/cluster/work/users/tomast/archive/n1850.ne16_tn14.diff_run1_run5_yr2/
Looking at the blom.hd.0002-01.nc I see changes in SST from January 1. In blom.hm.0002-01, templvl show differences at the surface, but not at depth. So from this it looks like the ocean-atmosphere interface could be a possible source of errors.