E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

SCM uninitialized memory fixes

Open bogensch opened this issue 2 months ago • 8 comments

This PR addresses an issue that arises when reading variables from an IOP file with fill_ends = false (the default for many fields). In these cases, values near the model top often remain uninitialized because the model only fills pressure levels for which data exist. To ensure consistent initialization, this PR explicitly zero-initializes all variables before reading each field where fill_ends are not applied. Required state variables (T, Q, U, and V) remain unchanged in behavior, as they always apply fill_ends automatically.

In addition, this PR removes the obsolete code block that computed large-scale vertical velocity on the interface grid. That logic was only needed for the Eulerian dy-core version of the SCM and is no longer required.

Also, fix a naming issue where q1obs was being read in where it should be q2obs. This bug has no effect on any of our cases since q2obs is only read in as a diagnostic.

Closes issue #6189

bogensch avatar Nov 14 '25 23:11 bogensch

With this branch, I tested:

SMS_D_Ln9_P24x1.ne4_ne4.FDPSCREAM-ARM97.pm-cpu_intel

but see error:

17:  (shr_scam_getCloseLatLonPIO) WARNING: Cant find appropriate latitude or longitu
17:  de coordinate variables

17: forrtl: error (65): floating invalid
17: Image              PC                Routine            Line        Source             
17: libpthread-2.31.s  0000148CC7A52910  Unknown               Unknown  Unknown
17: e3sm.exe           00000000038E7DCC  apply_iop_forcing          94  apply_iop_forcing.F90
17: e3sm.exe           0000000003627E0E  se_iop_intr_mod_m         366  se_iop_intr_mod.F90
17: e3sm.exe           00000000035C9161  dyn_comp_mp_dyn_r         470  dyn_comp.F90
17: e3sm.exe           0000000002323687  stepon_mp_stepon_         587  stepon.F90
17: e3sm.exe           0000000000990A21  cam_comp_mp_cam_r         352  cam_comp.F90
17: e3sm.exe           0000000000948A65  atm_comp_mct_mp_a         721  atm_comp_mct.F90
17: e3sm.exe           00000000004CD41F  component_mod_mp_         784  component_mod.F90
17: e3sm.exe           0000000000489918  cime_comp_mod_mp_        3233  cime_comp_mod.F90
17: e3sm.exe           00000000004B5687  MAIN__                    153  cime_driver.F90

I do see that SMS_D_R_Ld5.ne4_ne4.FSCM-ARM97.pm-cpu_intel.eam-scm seems to run OK

ndkeen avatar Nov 20 '25 18:11 ndkeen

Rooviewer Clock   Follow along on Roo Cloud

Reviewing your PR now. I'll share feedback in a few minutes!

roomote[bot] avatar Dec 05 '25 19:12 roomote[bot]

Rooviewer Clock   See task on Roo Cloud

Review complete. The approach to move initialization inside getinterpncdata is sound for most variables, but missed a critical edge case for variables read with fill_ends=.true.. This likely caused the reported runtime crashes.

  • [ ] Restore explicit zero-initialization for uls, vls, and wfld since they use fill_ends=.true..

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

roomote[bot] avatar Dec 05 '25 19:12 roomote[bot]

Confirmed that SMS_D_Ln9_P24x1.ne4_ne4.FDPSCREAM-ARM97.pm-cpu_intel is now passing on this branch.

rljacob avatar Dec 10 '25 22:12 rljacob

merged to next.

rljacob avatar Dec 11 '25 02:12 rljacob

Baseline diff with SMS_R_Ld5.ne4_ne4.FSCM-ARM97.chrysalis_intel.eam-scm expected?

rljacob avatar Dec 11 '25 17:12 rljacob

No, I did not expect that. Let me look into this.

bogensch avatar Dec 11 '25 19:12 bogensch

I can reproduce these differences and I think I understand it and how to fix; but I'm going to need more time to sort through this properly when I'm back from vacation. Basically, I'm not comfortable merging this PR as is.

bogensch avatar Dec 12 '25 01:12 bogensch