E3SM ELM is hanging writing output for ne120/ne256

@ndkeen found the model is hanging when trying to write out elm.h0 files at ne120/ne256

Sep 22 '22 15:09 bishtgautam

Thanks, I have been reporting this on other issues, but good to have separate one. This has only happened so far with scream cases. Specifically for me, it seems ok at ne30, but with ne120/ne256, it will hang when trying to write elm.h0 files which would be written after a month. I'm able to work-around this issue by asking that the files not be written:

cat <<EOF >> user_nl_elm
hist_nhtfrq = -999999999  ! Output frequency
hist_mfilt = 1            ! History file has 1 time sample
hist_empty_htapes = .true.
EOF

It might also happen at ne1024, but we've not run for a month.

Also, so far this has only happened on pm-gpu, but that's only because I've not yet tried it on pm-cpu as it's so much slower to simulate. Can we change settings to force elm.h0 write to happen sooner to see if it happens on pm-cpu (or other machine)?

I see that @AaronDonahue is able to write elm.h1 files on summit, but my scream cases do not seem to be writing these files at all.

@ndkeen

Sep 22 '22 16:09 ndkeen

@ndkeen @bishtgautam I think the runs that @lee1046 has done that also showed this issue were using ne30pg2

Sep 22 '22 16:09 whannah1

Oh actually I did have an issue for this already https://github.com/E3SM-Project/scream/issues/1920

Sep 22 '22 16:09 ndkeen

Closing this issue and will continue the discussion about it in https://github.com/E3SM-Project/scream/issues/1920

Sep 22 '22 16:09 bishtgautam

I was able to reproduce this issue with a Python script provided by @lee1046 compset = F2010-MMF1 res = ne30pg2_r05_oECv3 arch = GNUGPU num_nodes = 128

Latest E3SM master and latest SCORPIO master are used, case run wall time is set to 30-min.

First run shows an in-complete elm.h0 file and a possibly corrupted file name (c.2000-01.nc)

73402172 Sep 13 18:50 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.02.scorpio.c.2000-01.nc
   65808 Sep 13 18:50 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.02.scorpio.elm.h0.2000-01.nc

Second run has no output .nc files in the run directory, indicating it might hang somewhere during reading variables from input files.

Not reproducible with scorpio classic (apply a patch from https://github.com/E3SM-Project/scorpio/pull/479) which completed with a run time less than 15-min:

  73391948 Sep  9 11:46 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.cice.h.2000-01.nc
  65846848 Sep  9 11:47 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.cice.r.2000-02-03-00000.nc
 298812204 Sep  9 11:48 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.cpl.r.2000-02-03-00000.nc
 246544976 Sep  9 11:47 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.eam.h0.2000-01.nc
4844169904 Sep  9 11:48 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.eam.r.2000-02-03-00000.nc
 496762848 Sep  9 11:48 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.eam.rh0.2000-02-03-00000.nc
  11585280 Sep  9 11:48 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.eam.rs.2000-02-03-00000.nc
 462483440 Sep  9 11:46 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.elm.h0.2000-01.nc
4371917364 Sep  9 11:48 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.elm.r.2000-02-03-00000.nc
1030090044 Sep  9 11:48 E3SM.GNUGPU.ne30pg2_r05_oECv3.F2010-MMF1.NXY_64x1_DX_1000.MOMFB.BVT.00.elm.rh0.2000-02-03-00000.nc

I suspect that ELM might have memory issues to cause hanging even during read (also responsible for the corruption of one output file name mentioned above).

Sep 23 '22 19:09 dqwu

I'm noting this on several issues that describe hanging runs on Perlmutter - I've just verified a fix/workaround suggested by @jayeshkrishna in several different compsets (F2010, WCYCL, MMF). Just need to add these environment variables in config_machines.xml:

<env name="MPICH_COLL_OPT_OFF">1</env>
<env name="MPICH_SHARED_MEM_COLL_OPT">0</env>

Oct 28 '22 22:10 whannah1

E3SM E3SM copied to clipboard

ELM is hanging writing output for ne120/ne256

E3SM
E3SM copied to clipboard