NorESM icon indicating copy to clipboard operation
NorESM copied to clipboard

Problem with DMS couple flux when restarting NorESM2.3 with 1/4 BLOM.

Open YanchunHe opened this issue 10 months ago • 34 comments

Describe the bug Please provide a clear and concise description of what the bug is.

  • NorESM version: noresm2_3_develop (commits: #8e716d6)
  • HPC platform: Betzy
  • Compiler (if applicable): intel-compilers/2022.1.0, OpenMPI/4.1.4-intel-compilers-2022.1.0, iomkl/2022a, CMake/3.23.1-GCCc
  • Compset (if applicable): NHISTfrc2
  • Resolution (if applicable): f09_tn0254
  • Error message (if applicable):
Opened existing file NHISTfrc2_OC25_20200107_hfreq.cam.rs.1990-01-01-00000.nc
     2031616
 ERROR:
 component_mod:check_fields NaN found in OCN instance:    1 field Faoo_fdms_ocn
 1d global index:  1541736
 ERROR:
 component_mod:check_fields NaN found in OCN instance:    1 field Faoo_fdms_ocn
 1d global index:  1468295
 ERROR:
 component_mod:check_fields NaN found in OCN instance:    1 field Faoo_fdms_ocn
 1d global index:  1509722
...
--------------------------------------------------------------------------
Image              PC                Routine            Line        Source
cesm.exe           000000000297C3A7  Unknown               Unknown  Unknown
cesm.exe           0000000002601FCD  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           000000000043BF0D  component_type_mo         257  component_type_mod.F90
cesm.exe           0000000000437ABD  component_mod_mp_         731  component_mod.F90
cesm.exe           000000000041D1D2  cime_comp_mod_mp_        3433  cime_comp_mod.F90
cesm.exe           00000000004372F7  MAIN__                    125  cime_driver.F90
cesm.exe           0000000000419622  Unknown               Unknown  Unknown
libc.so.6          000014D86BA3FEB0  Unknown               Unknown  Unknown
libc.so.6          000014D86BA3FF60  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000419525  Unknown               Unknown  Unknown
--------------------------------------------------------------------------

To Reproduce Steps to reproduce the behavior:

  1. setup case with NorESM2_3_develop (commits: #8e716d6), with above mentioned compset and resolution
  2. restart from NorESM2.0.x case restart files: /nird/projects/NS9560K/noresm/cases/NHISTfrc2_OC25_20200107_hfreq/rest/1990-01-01-00000/
  3. Error occurred during restart as shown above

Expected behavior The model should restart with the same atmospheric and ocean model resolutions.

Screenshots see as above error message

Additional context I have done the following attempts to debug:

  1. all the values in Faoo_fdms_ocn seems OK, either 0, or of order of E-18.
  2. the metadata of Faoo_fdms_ocn seems to be OK. x2a_Faoo_fdms_ocn:_FillValue = 1.e+36
  3. I made separate fortran script to check with fortran intrinsic function 'ieee_is_nan' the values are OK.
  4. I set all values of x2a_Faoo_fdms_ocn as zero in the restart files, but the model report the same error.

The error occurs in lines of cime/src/drivers/mct/main/component_type_mod.F90

       if(any(shr_infnan_isnan(comp%c2x_cc%rattr))) then
          do fld=1,nflds
             do n=1,lsize
                if(shr_infnan_isnan(comp%c2x_cc%rattr(fld,n))) then
                   call mpi_comm_rank(comp%mpicom_compid, rank, ierr)
                   call mct_gsMap_orderedPoints(comp%gsmap_cc, rank, gpts)
                   write(msg,'(a,a,a,i4,a,a,a,i8)')'component_mod:check_fields NaN found in ',trim(comp%name),' instance: ',&
                        comp_index,' field ',trim(mct_avect_getRList2c(fld, comp%c2x_cc)), ' 1d global index: ',gpts(n)
                   call shr_sys_abort(msg)
                endif
             enddo
          enddo
       endif
    endif

One speculation is that it might be related to the ocean-atmos mapping file?

the 1d global index: is actually referring to the ocean grid, so the index can be much larger then the coupler restart file (Faoo_fdms_ocn has number of points of ~55k, while the reported OCN instance global index is ~1500k). Slight changes of the ocean-atmos mapping files (in BLOM) from different NorESM2.x version will lead to the problem?

YanchunHe avatar Feb 28 '25 10:02 YanchunHe

@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to v1.6.6.

We made some changes in BLOM default settings for the v1.6 tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settings

Probably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.

TomasTorsvik avatar Feb 28 '25 12:02 TomasTorsvik

@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to v1.6.6.

We made some changes in BLOM default settings for the v1.6 tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settings

Probably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.

Thank you, Tomas!

I manually changed the External, so that what I used for BLOM points to v1.6.7.

Here the issue is not to reproduce CMIP6 experiments, but re-run some CMIP6 type experiment, and we don't need it to be bit identical.

The problem here is the model can not restart from previous restart files. I guess the BLOM unit system does not matter, right?

@DirkOlivie mentioned he had some similar problem when setting some experiments with NorESM2.0.x too.

YanchunHe avatar Mar 03 '25 20:03 YanchunHe

@YanchunHe Have you tried different restart files, just to make sure that this particular set of restart files is not corrupted (the run seems to be from 2020, so the files have been copied back and forth probably)?

JorgSchwinger avatar Mar 04 '25 08:03 JorgSchwinger

Hi @YanchunHe , I lately ran a couple of times into issues (with master) - for reasons of units, wrong parameter values (for the units and coordinate system used) - I am not entirely sure, if this also holds for the v1.6x versions, though (but I would guess so). So I feel it is worth checking for these things. MKS versus CGS can be figured out by checking sigma(r) in the restart files - if is greater than 1, it is MKS, otherwise CGS. Isopycnic versus hybrid: check for the number of vertical layers - 53: likely isopycnic, if it is 56 it is likely hybrid. Then: for isopycnic runs, one has to adjust the parameters as Tomas already suggested (to be put in the user_BLOM_nml) - note the different parameter values for different grid resolutions.

jmaerz avatar Mar 04 '25 09:03 jmaerz

@TomasTorsvik , can we rule out via the regression testing that DMS fluxes have been deteriorated (as in: are the fluxes tested by default in the regression testing)? (I am asking since I recently had some issues with the NOy and NHx fluxes that seem to have persisted over quite some time - and Dirk reported also some issues with DMS fluxes recently, while I am not aware of the details).

jmaerz avatar Mar 04 '25 10:03 jmaerz

@jmaerz , @DirkOlivie , @gold2718 , @YanchunHe

For NorESM2.0/2.1/2.3 we are testing with DMS fluxes enabled.

For noresm2_3_develop we have

    <alias>N1850frc2</alias>
    <lname>1850_CAM60%NORESM%FRC2_CLM50%BGC-CROP_CICE%NORESM-CMIP6_BLOM%ECO_MOSART_SGLC_SWAV_BGC%BDRDDMS</lname>

The oldest baseline run is for 2.1.1 from February 2024. We also have some more recent baseline runs for 2.3 alpha tags.

It seems that we are not running with DMS fluxes by default for NorESM2.5/3 testing. For noresm_develop we have

    <alias>N1850</alias>
    <lname>1850_CAM70%LT%NORESM%CAMoslo_CLM60%SP_CICE_BLOM%HYB%ECO_MOSART_DGLC%NOEVOLVE_SWAV_SESP</lname>

Is there any reason why we should not run with DMS for NorESM2.5/3 testing?

TomasTorsvik avatar Mar 04 '25 11:03 TomasTorsvik

DMS fluxes are always on, there is no option to turn them off as far as I am aware.

JorgSchwinger avatar Mar 04 '25 11:03 JorgSchwinger

@JorgSchwinger - DMS fluxes are always on in iHAMOCC, but I thought you needed the compset setting BGC%BDRDDMS to tell the mediator to transfer DMS fluxes to the atmosphere. Is this not the case?

TomasTorsvik avatar Mar 04 '25 12:03 TomasTorsvik

Ok, yes that is probably correct (but I wouldn't know if there is something similar with the new mediator)

JorgSchwinger avatar Mar 04 '25 12:03 JorgSchwinger

It seems there is no option BGC%BDRDDMS available for NorESM2.5/3. @gold2718 - do you know, are DMS fluxes transferred by default without the additional BGC setting for NorESM2.5/3?

TomasTorsvik avatar Mar 04 '25 12:03 TomasTorsvik

It seems there is no option BGC%BDRDDMS available for NorESM2.5/3. @gold2718 - do you know, are DMS fluxes transferred by default without the additional BGC setting for NorESM2.5/3?

The new method is that if the NUOPC configuration parameter, flds_dms = "on", then CAM will use that field as its DMS source. There is no longer a namelist switch in CAM to do this.

In turn, that NUOPC parameter is set if "ecosys" in case.get_value("BLOM_TRACER_MODULES"): so it sounds like as long as iHAMOCC is part of the BLOM configuration, the atmosphere will expect DMS from the ocean.

gold2718 avatar Mar 04 '25 20:03 gold2718

I had a similar problem when I was running NorESM2.0.2 (with updated CIME to be able to run on Betzy) two weeks ago : a N1850 simulation (2x2 degrees in the atmosphere) stopped after 3 model hours (6 atmospheric time steps) with a similar error message:

ERROR:
 component_mod:check_fields NaN found in OCN instance:    1 field Faoo_fdms_ocn
 1d global index:    60701

DirkOlivie avatar Mar 05 '25 09:03 DirkOlivie

Maybe worth to mention that the DMS flux issue that Dirk mentioned seem not to appear when using the current master of BLOM (I recently was using Dirks setup and plugged in master - where this issue didn't appear) - but it is run with the MCT coupler, as far as I understand.

jmaerz avatar Mar 05 '25 10:03 jmaerz

I have not seen this errors when running system tests. I'm trying now to run tests for noresm2_3_alpha03 comparing with noresm2.1.1 baseline, it is taking a while to get these jobs started.

@DirkOlivie - is the error you found reproducible? Did you do anything to mitigate/bypass the error?

TomasTorsvik avatar Mar 05 '25 12:03 TomasTorsvik

@TomasTorsvik

The error was reproducible. By commenting out the call shr_sys_abort(msg) (see code of Yanchun), the model could by-pass, but still stopped after around half a month of simulation (801 atmospheric time steps).

DirkOlivie avatar Mar 05 '25 13:03 DirkOlivie

@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to v1.6.6.

We made some changes in BLOM default settings for the v1.6 tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settings

Probably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.

@YanchunHe - I looked up the commit #8e716d6, it is updating BLOM to v1.6.6.

We made some changes in BLOM default settings for the v1.6 tag, but in principle we are still able to reproduce CMIP6 runs bfb, by including some namelist settings and change units from MKS to CGS, see wiki page: https://github.com/NorESMhub/BLOM/wiki/new-BLOM-with-CMIP6-settings

Probably this is not relevant for the bug you discovered, but you might need to make similar to what is described in the BLOM wiki, if you continue form NorESM2 spinup.

again, I will rerun the case with 'cgs' on, and see if it helps.

YanchunHe avatar Mar 06 '25 20:03 YanchunHe

@YanchunHe Have you tried different restart files, just to make sure that this particular set of restart files is not corrupted (the run seems to be from 2020, so the files have been copied back and forth probably)?

Thanks, and yes, I tried to restart from different years. The same error.

YanchunHe avatar Mar 06 '25 20:03 YanchunHe

Hi @YanchunHe , I lately ran a couple of times into issues (with master) - for reasons of units, wrong parameter values (for the units and coordinate system used) - I am not entirely sure, if this also holds for the v1.6x versions, though (but I would guess so). So I feel it is worth checking for these things. MKS versus CGS can be figured out by checking sigma(r) in the restart files - if is greater than 1, it is MKS, otherwise CGS. Isopycnic versus hybrid: check for the number of vertical layers - 53: likely isopycnic, if it is 56 it is likely hybrid. Then: for isopycnic runs, one has to adjust the parameters as Tomas already suggested (to be put in the user_BLOM_nml) - note the different parameter values for different grid resolutions.

How to check if it is using isopycnal or hybrid before any output data?

NorESM2.3 is by default isopycnal, right?

YanchunHe avatar Mar 06 '25 20:03 YanchunHe

@YanchunHe , yes, 2.3 should be isopycnical. You can try:

/xmlquery BLOM_VCOORD

in your case setup, which should result in:

BLOM_VCOORD: isopyc_bulkml

(otherwise you can set it). The CMIP6 input data you're using should be/are isopycnic.

jmaerz avatar Mar 06 '25 20:03 jmaerz

I tried with the latest release of NorESM2_3_develop (alpha03 compared to what I used alpha01, with some reported issues fixed, e.g., BLOM #474, and air-land masks, ocean topography, etc ). And it seems working with the restart.

I also made some changes in the user_nl_* to reduce the output as I mainly need some daily ocean transport fields. Seems these changes also have an impact on the crash of the model, although I did not expect that. I will need to further check that later.

I will update it here later.

Thanks all!

YanchunHe avatar Mar 12 '25 11:03 YanchunHe

@YanchunHe, when you say it impacts the crash, is the crash always similar to what is posted in the issue header (i.e., NaNs found in a field)?

gold2718 avatar Mar 12 '25 12:03 gold2718

Ok, yes that is probably correct (but I wouldn't know if there is something similar with the new mediator)

@JorgSchwinger - the way the new nuopc mediator works with DMS is as follows:

  • there is a driver config variable - flds_dms - that is in nuopc.runconfig and is available to all components. As a result both CAM and BLOM will know if DMS is passed from BLOM to CAM. This config variable is set as follows:
  • flds_dms is defined in components/cmeps/cime_config/namelist_definition_drv.xml
<entry id="flds_dms">
   <type>logical</type>
   <category>flds</category>
   <group>ALLCOMP_attributes</group>
   <desc>
     Pass DMS from OCN to ATM component
   </desc>
   <values>`
     <value>.false.</value>`
     <value dms_from_ocn="on">.true.</value>
   </values>
 </entry>
  • components/cmeps/cime_config/buildnml sets the config variable dms_from_ocn as follows
if config["COMP_OCN"] == "blom":
        if "ecosys" in case.get_value("BLOM_TRACER_MODULES"):
            config["dms_from_ocn"] = "on"
        else:
            config["dms_from_ocn"] = "off"
  • both CAM and BLOM then have access to flds_dms
  • in BLOM it is set in ocn_comp_nuopc.F90 as follows:
! Determine if will export dms
     call NUOPC_CompAttributeGet(gcomp, name='flds_dms', value=cvalue, &
          ispresent=ispresent, isset=isset, rc=rc)
     if (ChkErr(rc, __LINE__, u_FILE_u)) return
     if (isPresent .and. isSet) then
        read(cvalue,*) flds_dms
        if (.not. hamocc_defined) then
           ! if not defined HAMOCC and request to export dms, abort
           if (flds_dms) then
              write(lp,'(a)') subname//' cannot export dms with out HAMOCC defined'
              call xchalt(subname)
              stop subname
           end if
        end if
     else
        flds_dms = .false.
     end if
     write(msg,'(a,l1)') subname//': export dms ', flds_dms
     call blom_logwrite(msg)
  • in CAM it is used in atm_import_export as follows:
call NUOPC_CompAttributeGet(gcomp, name='flds_dms', value=cvalue, ispresent=ispresent, isset=isset, rc=rc)
    if (ChkErr(rc,__LINE__,u_FILE_u)) return
    if (ispresent .and. isset) then
       read(cvalue,*) dms_from_ocn
    else
       dms_from_ocn = .false.
    end if
    if (masterproc) write(iulog,'(a,l)') trim(subname)//'dms_from_ocn = ',dms_from_ocn
    write(6,'(a,l)')trim(subname)//'dms_from_ocn = ',dms_from_ocn

Should this be documented somewhere to clarify how optiontal variable transfer between components occur in with CMEPS?

mvertens avatar Mar 12 '25 13:03 mvertens

@YanchunHe, when you say it impacts the crash, is the crash always similar to what is posted in the issue header (i.e., NaNs found in a field)?

Yes, it is the same, NaN found in Fa2x_dms_xxx.

I have some work at hand right now, but will return on the soon.

Thanks a lot!

YanchunHe avatar Mar 12 '25 16:03 YanchunHe

It turns out, that change of the output frequency of vertical mass flux in BLOM cause the problem.

If I add in the user_nl_blom:

LYR_WFLX     = 4, 4, 0

It will have the problem with the NaN found in Fa2x_dms_xxx error!

By default, LYR_WFLX should have monthly output, here I want to turn on daily output.

This worked (with LYR_WFLX on) for 1-deg ocean version (NorESM2-MM) of noresm2.0.8.

Sounds this is not relevant to DMS flux. Not sure how this affect restarting the DMS from restart file.

YanchunHe avatar Mar 24 '25 08:03 YanchunHe

@YanchunHe this variable is saved in diagnostic file or restart file?

monsieuralok avatar Mar 24 '25 09:03 monsieuralok

@YanchunHe this variable is saved in diagnostic file or restart file?

It is not in the restart file. But by default, there is monthly output for it if we don't change it in the user_nm_blom (i.e., keep LYR_WFLX = 0, 4, 0 )

YanchunHe avatar Mar 24 '25 12:03 YanchunHe

@YanchunHe I am testing this setup and update you

monsieuralok avatar Mar 24 '25 12:03 monsieuralok

Are you saving the daily LYR_WFLX in monthly files (as it is the standard)? At 1/4 degree resolution that must be a huge file? Are you using PNETCDF? Otherwise everything would be gathered on one single processor, and that could be an issue memory wise?

JorgSchwinger avatar Mar 24 '25 14:03 JorgSchwinger

Are you saving the daily LYR_WFLX in monthly files (as it is the standard)? At 1/4 degree resolution that must be a huge file? Are you using PNETCDF? Otherwise everything would be gathered on one single processor, and that could be an issue memory wise?

Yes, we should save the transport field to diagnose the water parcel trajectories. WFLX can be an output, but not a must. It is indeed quite large. A single variable on daily in a monthly file has about 11 GB. We have several variables. But looks like the model and storage can handle this.

YanchunHe avatar Mar 24 '25 21:03 YanchunHe

Just related to the daily files, if you want to avoid such large files: You could change the output settings to

GLB_FNAMETAG = 'hd','hm','hy' GLB_AVEPERIO = 1, 30, 365 GLB_FILEFREQ = 1, 30, 365

(instead of using GLB_FILEFREQ = 30, 30, 365). This should write the daily output in daily files (although I have not used this setting myself).

JorgSchwinger avatar Mar 25 '25 06:03 JorgSchwinger