scream icon indicating copy to clipboard operation
scream copied to clipboard

Master is not BFB in radiation surface fluxes for any new run on Weaver

Open AaronDonahue opened this issue 2 years ago • 14 comments

The following fields from rrtmgp are non-BFB when comparing outputs from two identical runs: sfc_flux_dif_nir, sfc_flux_dir_nir, sfc_flux_dif_vis and sfc_flux_dir_vis.

This has been confirmed by running the monolithic_vs_restarted test (which FAILS) and the homme_shoc_cld_spa_p3_rrtmgp test. For the homme_shoc_cld_spa_p3_rrtmgp test, I ran it and changed the name of the output, and then ran it again and compared that output with the original one. So the runs were identical. The latter confirms that this isn't a restart problem but means we are non-BFB for every new run.

To reproduce one only has to make these changes:

[asdonah@weaver11 scream]$ git diff
diff --git a/components/scream/tests/coupled/dynamics_physics/model_restart/model_output.yaml b/components/scream/tests/coupled/dynamics_physics/model_restart/model_output.yaml
index 441dace..829228a 100644
--- a/components/scream/tests/coupled/dynamics_physics/model_restart/model_output.yaml
+++ b/components/scream/tests/coupled/dynamics_physics/model_restart/model_output.yaml
@@ -26,6 +26,16 @@ Fields:
       - LW_flux_up
       - SW_flux_dn
       - SW_flux_up
+      - sfc_alb_dif_nir
+      - sfc_alb_dif_vis
+      - sfc_alb_dir_nir
+      - sfc_alb_dir_vis
+      - sfc_flux_dif_nir
+      - sfc_flux_dif_vis
+      - sfc_flux_dir_nir
+      - sfc_flux_dir_vis
+      - sfc_flux_lw_dn
+      - sfc_flux_sw_net
   Dynamics:
     Field Names:
       - Qdp_dyn
diff --git a/components/scream/tests/coupled/dynamics_physics/model_restart/model_restart_output.yaml b/components/scream/tests/coupled/dynamics_physics/model_restart/model_restart_output.yaml
index e2c32a7..7a50200 100644
--- a/components/scream/tests/coupled/dynamics_physics/model_restart/model_restart_output.yaml
+++ b/components/scream/tests/coupled/dynamics_physics/model_restart/model_restart_output.yaml
@@ -26,6 +26,16 @@ Fields:
       - LW_flux_up
       - SW_flux_dn
       - SW_flux_up
+      - sfc_alb_dif_nir
+      - sfc_alb_dif_vis
+      - sfc_alb_dir_nir
+      - sfc_alb_dir_vis
+      - sfc_flux_dif_nir
+      - sfc_flux_dif_vis
+      - sfc_flux_dir_nir
+      - sfc_flux_dir_vis
+      - sfc_flux_lw_dn
+      - sfc_flux_sw_net
   Dynamics:
     Field Names:
       - Qdp_dyn

AaronDonahue avatar Jun 06 '22 23:06 AaronDonahue

This probably points to some memory not being initialized properly. @brhillman do you have any thoughts?

AaronDonahue avatar Jun 06 '22 23:06 AaronDonahue

I'll also add, I don't see this problem on blake but do see it on weaver so it could be a CPU vs GPU thing. Or possible sensitive to compiler.

AaronDonahue avatar Jun 06 '22 23:06 AaronDonahue

Are these diagnostic or prognostic outputs?

ambrad avatar Jun 06 '22 23:06 ambrad

Are these diagnostic or prognostic outputs?

Prognostic. I believe they are the forcing passed from rrtmgp to the surface

AaronDonahue avatar Jun 07 '22 01:06 AaronDonahue

@AaronDonahue can you summarize the SCREAM process ordering for me? I assume the surface fluxes of T/q/u are applied in SHOC, so I'm particularly interested in where radiation and SHOC sit in the loop relative to dynamics and surface coupling, and how it might differ from E3SM.

whannah1 avatar Jun 07 '22 14:06 whannah1

@whannah1 , for sure. The process order is SRFC-Import -> Dynamics -> SHOC -> CloudFraction -> SPA -> P3 -> RRTMGP -> SRFC-Export

*note CloudFraction is similar to what is done in SHOC for EAM, but we made calculating the total and ice cld fraction it's own process.

AaronDonahue avatar Jun 07 '22 15:06 AaronDonahue

I wrote the same thing as Aaron but forgot to hit send. Note as well that this info is all encoded in namelist_scream.xml, but perhaps in a not-easy-to-parse way:

    <physics>
      <atm_procs_list type="string">(mac_aero_mic,rrtmgp)</atm_procs_list>
    <Type type="string">Group</Type>
      <Schedule__Type type="string">Sequential</Schedule__Type>
    <Number__of__Subcycles constraints="gt 0" type="integer">1</Number__of__Subcycles>
      <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    <mac_aero_mic>
      <atm_procs_list type="string">(shoc,cldFraction,spa,p3)</atm_procs_list>
      <Number__of__Subcycles type="integer">3</Number__of__Subcycles>
      <Type type="string">Group</Type>
      <Schedule__Type type="string">Sequential</Schedule__Type>
    <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    <shoc><Grid type="string">Physics GLL</Grid>
    <Number__of__Subcycles constraints="gt 0" type="integer">1</Number__of__Subcycles>
      <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    </shoc>

    
    <cldFraction><Grid type="string">Physics GLL</Grid>
    <Number__of__Subcycles constraints="gt 0" type="integer">1</Number__of__Subcycles>
      <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    </cldFraction>

    
    <spa>
      <SPA__Remap__File type="string">/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/map_ne30np4_to_ne120np4_mono_20220502.nc</SPA__Remap__File>
      <SPA__Data__File type="string">/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc</SPA__Data__File>
    <Grid type="string">Physics GLL</Grid>
    <Number__of__Subcycles constraints="gt 0" type="integer">1</Number__of__Subcycles>
      <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    </spa>

    
    <p3><Grid type="string">Physics GLL</Grid>
    <Number__of__Subcycles constraints="gt 0" type="integer">1</Number__of__Subcycles>
      <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    </p3>

    
    </mac_aero_mic>

    <rrtmgp>
      <active_gases type="array(string)">h2o, co2, o3, n2o, co, ch4, o2, n2</active_gases>
      <Orbital__Year type="integer">-9999</Orbital__Year>
      <Orbital__Eccentricity type="integer">-9999</Orbital__Eccentricity>
      <Orbital__Obliquity type="integer">-9999</Orbital__Obliquity>
      <Orbital__MVELP type="integer">-9999</Orbital__MVELP>
      <rad_frequency type="integer">4</rad_frequency>
      <Grid type="string">Physics GLL</Grid>
    <Number__of__Subcycles constraints="gt 0" type="integer">1</Number__of__Subcycles>
      <Enable__Precondition__Checks type="logical">true</Enable__Precondition__Checks>
      <Enable__Postcondition__Checks type="logical">true</Enable__Postcondition__Checks>
    </rrtmgp>

    </physics>

PeterCaldwell avatar Jun 07 '22 16:06 PeterCaldwell

It's on a light todo list to print this information to the atm.log file for reference in a human-readable way.

AaronDonahue avatar Jun 07 '22 16:06 AaronDonahue

@AaronDonahue thanks, can you also clarify where the restart files are written? Also, as you know, E3SM start with a call to "redundant" call to tphysbc upon restart, does SCREAM do something similar?

whannah1 avatar Jun 07 '22 16:06 whannah1

My understanding is that they are written after rrtmgp, at the end of the atmosphere step. We do not have the redundant tphysbc that EAM has, but recall our process order is different too. In EAM we have IMPORT -> tphysac -> dynamics -> tphysbc -> EXPORT.

But maybe more importantly for this issue. It is unrelated to restarts. I happened to catch it because a restart test failed, but if I run a fresh clean initialized run twice, i.e., two completely new and clean runs independent of each other, the output is non-BFB in those four variables. Hence probably pointing to uninitialized memory being used somewhere.

Although since we are on the topic, I am open to also discuss whether or not we are handling restarts correctly. My intuition says yes, because we have a restart test that runs a baseline 2 step simulation and compares against a 1 step and then 1 step restarted run, and they match (besides the issue being discussed here).

AaronDonahue avatar Jun 07 '22 16:06 AaronDonahue

From my understanding of E3SM, the redundant tphysbc call is needed for several reasons, one of which being that the downward radiative fluxes are not saved to the restart files, so they need to be "re-populated" before doing the surface coupling. I assume in SCREAM you guys have avoided this problem by simply saving those rad fluxes to the restart file.

This seems closely related to the initialization problem because E3SM calls tphysbc before surface coupling for similar reasons, i.e. the surface components need those rad flux values to have a reasonable start up. So how does SCREAM handle this issue? Do you also call parts of the atmosphere physics to be able to export proper values prior to the first call to the surface components?

EDIT - after discussing this stuff further with Aaron the initialization details seem orthogonal to the current issue, so we don't have to discuss this stuff any further.

whannah1 avatar Jun 07 '22 17:06 whannah1

Could this be related to what we are seeing in #2201 ? @brhillman, @ndkeen bringing this issue back up since it looks like we have been running into some other non-BFB stuff lately that may be related to radiation.

AaronDonahue avatar Mar 13 '23 15:03 AaronDonahue

  - sfc_alb_dir_vis
  - sfc_flux_lw_dn
  - sfc_flux_sw_net

are in our default output, and our nightly tests pass on pm-gpu/crusher. The monolithic_vs_restarted test passes on weaver as well. Is this still an issue?

bartgol avatar Mar 13 '23 15:03 bartgol

@AaronDonahue can we close this?

bartgol avatar Sep 16 '23 02:09 bartgol