Problem with DMS couple flux when restarting NorESM2.3 with 1/4 BLOM.
Describe the bug Please provide a clear and concise description of what the bug is.
- NorESM version: noresm2_3_develop (commits: #8e716d6)
- HPC platform: Betzy
- Compiler (if applicable): intel-compilers/2022.1.0, OpenMPI/4.1.4-intel-compilers-2022.1.0, iomkl/2022a, CMake/3.23.1-GCCc
- Compset (if applicable): NHISTfrc2
- Resolution (if applicable): f09_tn0254
- Error message (if applicable):
Opened existing file NHISTfrc2_OC25_20200107_hfreq.cam.rs.1990-01-01-00000.nc
2031616
ERROR:
component_mod:check_fields NaN found in OCN instance: 1 field Faoo_fdms_ocn
1d global index: 1541736
ERROR:
component_mod:check_fields NaN found in OCN instance: 1 field Faoo_fdms_ocn
1d global index: 1468295
ERROR:
component_mod:check_fields NaN found in OCN instance: 1 field Faoo_fdms_ocn
1d global index: 1509722
...
--------------------------------------------------------------------------
Image PC Routine Line Source
cesm.exe 000000000297C3A7 Unknown Unknown Unknown
cesm.exe 0000000002601FCD shr_abort_mod_mp_ 114 shr_abort_mod.F90
cesm.exe 000000000043BF0D component_type_mo 257 component_type_mod.F90
cesm.exe 0000000000437ABD component_mod_mp_ 731 component_mod.F90
cesm.exe 000000000041D1D2 cime_comp_mod_mp_ 3433 cime_comp_mod.F90
cesm.exe 00000000004372F7 MAIN__ 125 cime_driver.F90
cesm.exe 0000000000419622 Unknown Unknown Unknown
libc.so.6 000014D86BA3FEB0 Unknown Unknown Unknown
libc.so.6 000014D86BA3FF60 __libc_start_main Unknown Unknown
cesm.exe 0000000000419525 Unknown Unknown Unknown
--------------------------------------------------------------------------
To Reproduce Steps to reproduce the behavior:
- setup case with NorESM2_3_develop (commits: #8e716d6), with above mentioned compset and resolution
- restart from NorESM2.0.x case restart files: /nird/projects/NS9560K/noresm/cases/NHISTfrc2_OC25_20200107_hfreq/rest/1990-01-01-00000/
- Error occurred during restart as shown above
Expected behavior The model should restart with the same atmospheric and ocean model resolutions.
Screenshots see as above error message
Additional context I have done the following attempts to debug:
- all the values in
Faoo_fdms_ocnseems OK, either 0, or of order of E-18. - the metadata of
Faoo_fdms_ocnseems to be OK.x2a_Faoo_fdms_ocn:_FillValue = 1.e+36 - I made separate fortran script to check with fortran intrinsic function 'ieee_is_nan' the values are OK.
- I set all values of
x2a_Faoo_fdms_ocnas zero in the restart files, but the model report the same error.
The error occurs in lines of cime/src/drivers/mct/main/component_type_mod.F90
if(any(shr_infnan_isnan(comp%c2x_cc%rattr))) then
do fld=1,nflds
do n=1,lsize
if(shr_infnan_isnan(comp%c2x_cc%rattr(fld,n))) then
call mpi_comm_rank(comp%mpicom_compid, rank, ierr)
call mct_gsMap_orderedPoints(comp%gsmap_cc, rank, gpts)
write(msg,'(a,a,a,i4,a,a,a,i8)')'component_mod:check_fields NaN found in ',trim(comp%name),' instance: ',&
comp_index,' field ',trim(mct_avect_getRList2c(fld, comp%c2x_cc)), ' 1d global index: ',gpts(n)
call shr_sys_abort(msg)
endif
enddo
enddo
endif
endif
One speculation is that it might be related to the ocean-atmos mapping file?
the 1d global index: is actually referring to the ocean grid, so the index can be much larger then the coupler restart file (Faoo_fdms_ocn has number of points of ~55k, while the reported OCN instance global index is ~1500k). Slight changes of the ocean-atmos mapping files (in BLOM) from different NorESM2.x version will lead to the problem?
@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to v1.6.6.
We made some changes in BLOM default settings for the v1.6 tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page:
https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settings
Probably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.
@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to
v1.6.6.We made some changes in BLOM default settings for the
v1.6tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settingsProbably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.
Thank you, Tomas!
I manually changed the External, so that what I used for BLOM points to v1.6.7.
Here the issue is not to reproduce CMIP6 experiments, but re-run some CMIP6 type experiment, and we don't need it to be bit identical.
The problem here is the model can not restart from previous restart files. I guess the BLOM unit system does not matter, right?
@DirkOlivie mentioned he had some similar problem when setting some experiments with NorESM2.0.x too.
@YanchunHe Have you tried different restart files, just to make sure that this particular set of restart files is not corrupted (the run seems to be from 2020, so the files have been copied back and forth probably)?
Hi @YanchunHe , I lately ran a couple of times into issues (with master) - for reasons of units, wrong parameter values (for the units and coordinate system used) - I am not entirely sure, if this also holds for the v1.6x versions, though (but I would guess so). So I feel it is worth checking for these things. MKS versus CGS can be figured out by checking sigma(r) in the restart files - if is greater than 1, it is MKS, otherwise CGS. Isopycnic versus hybrid: check for the number of vertical layers - 53: likely isopycnic, if it is 56 it is likely hybrid. Then: for isopycnic runs, one has to adjust the parameters as Tomas already suggested (to be put in the user_BLOM_nml) - note the different parameter values for different grid resolutions.
@TomasTorsvik , can we rule out via the regression testing that DMS fluxes have been deteriorated (as in: are the fluxes tested by default in the regression testing)? (I am asking since I recently had some issues with the NOy and NHx fluxes that seem to have persisted over quite some time - and Dirk reported also some issues with DMS fluxes recently, while I am not aware of the details).
@jmaerz , @DirkOlivie , @gold2718 , @YanchunHe
For NorESM2.0/2.1/2.3 we are testing with DMS fluxes enabled.
For noresm2_3_develop we have
<alias>N1850frc2</alias>
<lname>1850_CAM60%NORESM%FRC2_CLM50%BGC-CROP_CICE%NORESM-CMIP6_BLOM%ECO_MOSART_SGLC_SWAV_BGC%BDRDDMS</lname>
The oldest baseline run is for 2.1.1 from February 2024. We also have some more recent baseline runs for 2.3 alpha tags.
It seems that we are not running with DMS fluxes by default for NorESM2.5/3 testing.
For noresm_develop we have
<alias>N1850</alias>
<lname>1850_CAM70%LT%NORESM%CAMoslo_CLM60%SP_CICE_BLOM%HYB%ECO_MOSART_DGLC%NOEVOLVE_SWAV_SESP</lname>
Is there any reason why we should not run with DMS for NorESM2.5/3 testing?
DMS fluxes are always on, there is no option to turn them off as far as I am aware.
@JorgSchwinger - DMS fluxes are always on in iHAMOCC, but I thought you needed the compset setting BGC%BDRDDMS to tell the mediator to transfer DMS fluxes to the atmosphere. Is this not the case?
Ok, yes that is probably correct (but I wouldn't know if there is something similar with the new mediator)
It seems there is no option BGC%BDRDDMS available for NorESM2.5/3.
@gold2718 - do you know, are DMS fluxes transferred by default without the additional BGC setting for NorESM2.5/3?
It seems there is no option
BGC%BDRDDMSavailable for NorESM2.5/3. @gold2718 - do you know, are DMS fluxes transferred by default without the additionalBGCsetting for NorESM2.5/3?
The new method is that if the NUOPC configuration parameter, flds_dms = "on", then CAM will use that field as its DMS source. There is no longer a namelist switch in CAM to do this.
In turn, that NUOPC parameter is set if "ecosys" in case.get_value("BLOM_TRACER_MODULES"): so it sounds like as long as iHAMOCC is part of the BLOM configuration, the atmosphere will expect DMS from the ocean.
I had a similar problem when I was running NorESM2.0.2 (with updated CIME to be able to run on Betzy) two weeks ago : a N1850 simulation (2x2 degrees in the atmosphere) stopped after 3 model hours (6 atmospheric time steps) with a similar error message:
ERROR:
component_mod:check_fields NaN found in OCN instance: 1 field Faoo_fdms_ocn
1d global index: 60701
Maybe worth to mention that the DMS flux issue that Dirk mentioned seem not to appear when using the current master of BLOM (I recently was using Dirks setup and plugged in master - where this issue didn't appear) - but it is run with the MCT coupler, as far as I understand.
I have not seen this errors when running system tests. I'm trying now to run tests for noresm2_3_alpha03 comparing with noresm2.1.1 baseline, it is taking a while to get these jobs started.
@DirkOlivie - is the error you found reproducible? Did you do anything to mitigate/bypass the error?
@TomasTorsvik
The error was reproducible. By commenting out the call shr_sys_abort(msg) (see code of Yanchun), the model could by-pass, but still stopped after around half a month of simulation (801 atmospheric time steps).
@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to
v1.6.6.We made some changes in BLOM default settings for the
v1.6tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settingsProbably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.
@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to
v1.6.6.We made some changes in BLOM default settings for the
v1.6tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settingsProbably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.
again, I will rerun the case with 'cgs' on, and see if it helps.
@YanchunHe Have you tried different restart files, just to make sure that this particular set of restart files is not corrupted (the run seems to be from 2020, so the files have been copied back and forth probably)?
Thanks, and yes, I tried to restart from different years. The same error.
Hi @YanchunHe , I lately ran a couple of times into issues (with
master) - for reasons of units, wrong parameter values (for the units and coordinate system used) - I am not entirely sure, if this also holds for the v1.6x versions, though (but I would guess so). So I feel it is worth checking for these things. MKS versus CGS can be figured out by checking sigma(r) in the restart files - if is greater than 1, it is MKS, otherwise CGS. Isopycnic versus hybrid: check for the number of vertical layers - 53: likely isopycnic, if it is 56 it is likely hybrid. Then: for isopycnic runs, one has to adjust the parameters as Tomas already suggested (to be put in the user_BLOM_nml) - note the different parameter values for different grid resolutions.
How to check if it is using isopycnal or hybrid before any output data?
NorESM2.3 is by default isopycnal, right?
@YanchunHe , yes, 2.3 should be isopycnical. You can try:
/xmlquery BLOM_VCOORD
in your case setup, which should result in:
BLOM_VCOORD: isopyc_bulkml
(otherwise you can set it). The CMIP6 input data you're using should be/are isopycnic.
I tried with the latest release of NorESM2_3_develop (alpha03 compared to what I used alpha01, with some reported issues fixed, e.g., BLOM #474, and air-land masks, ocean topography, etc ). And it seems working with the restart.
I also made some changes in the user_nl_* to reduce the output as I mainly need some daily ocean transport fields. Seems these changes also have an impact on the crash of the model, although I did not expect that. I will need to further check that later.
I will update it here later.
Thanks all!
@YanchunHe, when you say it impacts the crash, is the crash always similar to what is posted in the issue header (i.e., NaNs found in a field)?
Ok, yes that is probably correct (but I wouldn't know if there is something similar with the new mediator)
@JorgSchwinger - the way the new nuopc mediator works with DMS is as follows:
- there is a driver config variable -
flds_dms- that is in nuopc.runconfig and is available to all components. As a result both CAM and BLOM will know if DMS is passed from BLOM to CAM. This config variable is set as follows: - flds_dms is defined in
components/cmeps/cime_config/namelist_definition_drv.xml
<entry id="flds_dms">
<type>logical</type>
<category>flds</category>
<group>ALLCOMP_attributes</group>
<desc>
Pass DMS from OCN to ATM component
</desc>
<values>`
<value>.false.</value>`
<value dms_from_ocn="on">.true.</value>
</values>
</entry>
- components/cmeps/cime_config/buildnml sets the config variable
dms_from_ocnas follows
if config["COMP_OCN"] == "blom":
if "ecosys" in case.get_value("BLOM_TRACER_MODULES"):
config["dms_from_ocn"] = "on"
else:
config["dms_from_ocn"] = "off"
- both CAM and BLOM then have access to
flds_dms - in BLOM it is set in ocn_comp_nuopc.F90 as follows:
! Determine if will export dms
call NUOPC_CompAttributeGet(gcomp, name='flds_dms', value=cvalue, &
ispresent=ispresent, isset=isset, rc=rc)
if (ChkErr(rc, __LINE__, u_FILE_u)) return
if (isPresent .and. isSet) then
read(cvalue,*) flds_dms
if (.not. hamocc_defined) then
! if not defined HAMOCC and request to export dms, abort
if (flds_dms) then
write(lp,'(a)') subname//' cannot export dms with out HAMOCC defined'
call xchalt(subname)
stop subname
end if
end if
else
flds_dms = .false.
end if
write(msg,'(a,l1)') subname//': export dms ', flds_dms
call blom_logwrite(msg)
- in CAM it is used in atm_import_export as follows:
call NUOPC_CompAttributeGet(gcomp, name='flds_dms', value=cvalue, ispresent=ispresent, isset=isset, rc=rc)
if (ChkErr(rc,__LINE__,u_FILE_u)) return
if (ispresent .and. isset) then
read(cvalue,*) dms_from_ocn
else
dms_from_ocn = .false.
end if
if (masterproc) write(iulog,'(a,l)') trim(subname)//'dms_from_ocn = ',dms_from_ocn
write(6,'(a,l)')trim(subname)//'dms_from_ocn = ',dms_from_ocn
Should this be documented somewhere to clarify how optiontal variable transfer between components occur in with CMEPS?
@YanchunHe, when you say it impacts the crash, is the crash always similar to what is posted in the issue header (i.e., NaNs found in a field)?
Yes, it is the same, NaN found in Fa2x_dms_xxx.
I have some work at hand right now, but will return on the soon.
Thanks a lot!
It turns out, that change of the output frequency of vertical mass flux in BLOM cause the problem.
If I add in the user_nl_blom:
LYR_WFLX = 4, 4, 0
It will have the problem with the NaN found in Fa2x_dms_xxx error!
By default, LYR_WFLX should have monthly output, here I want to turn on daily output.
This worked (with LYR_WFLX on) for 1-deg ocean version (NorESM2-MM) of noresm2.0.8.
Sounds this is not relevant to DMS flux. Not sure how this affect restarting the DMS from restart file.
@YanchunHe this variable is saved in diagnostic file or restart file?
@YanchunHe this variable is saved in diagnostic file or restart file?
It is not in the restart file. But by default, there is monthly output for it if we don't change it in the user_nm_blom (i.e., keep LYR_WFLX = 0, 4, 0 )
@YanchunHe I am testing this setup and update you
Are you saving the daily LYR_WFLX in monthly files (as it is the standard)? At 1/4 degree resolution that must be a huge file? Are you using PNETCDF? Otherwise everything would be gathered on one single processor, and that could be an issue memory wise?
Are you saving the daily
LYR_WFLXin monthly files (as it is the standard)? At 1/4 degree resolution that must be a huge file? Are you using PNETCDF? Otherwise everything would be gathered on one single processor, and that could be an issue memory wise?
Yes, we should save the transport field to diagnose the water parcel trajectories. WFLX can be an output, but not a must. It is indeed quite large. A single variable on daily in a monthly file has about 11 GB. We have several variables. But looks like the model and storage can handle this.
Just related to the daily files, if you want to avoid such large files: You could change the output settings to
GLB_FNAMETAG = 'hd','hm','hy'
GLB_AVEPERIO = 1, 30, 365
GLB_FILEFREQ = 1, 30, 365
(instead of using GLB_FILEFREQ = 30, 30, 365). This should write the daily output in daily files (although I have not used this setting myself).