E3SM eamxx: Problem restarting after writing "new" yaml outputs that use horiz remapping file with ne1024 on frontier

For our CESS runs at ne1024 on frontier, we are trying to use some new yaml outputs. Two of the contain a horiz remap file which may be the issue here. The repo I'm using should be the machines/frontier branch with Luca's branch to fix an issue regarding remapping merged in bartgol/fix-coarsening-remapper-mask-handling. The case will actually run 1 completed day (even 2 days) and write restarts, but each time I've tried to restart from those, it hangs.

The new yaml outputs:

    ./atmchange output_yaml_files="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.23hourly_QcQiNcNi.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.23hourly_QrNrQmBm.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.3hourlyAVG_ne120.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.3hourlyINST_ne120.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.hourly_2Dvars.yaml"
    ./atmchange output_yaml_files+="/lustre/orion/cli115/proj-shared/terai/Cess/v1_output/scream_output.Cess.monthly_ne1024.yaml"

Last files written to:

-rw-rw-r-- 1 noel cli115         16411 Jun 29 11:23 homme_atm.log.1365799.230629-112302
-rw-r--r-- 1 noel cli115            47 Jun 29 11:25 mass.out
-rwxr-xr-t 1 noel cli115  105545467928 Jun 29 11:49 output.scream.23hourly_QcQiNcNi.INSTANT.nhours_x23.2019-08-01-00000.nc*
-rw-rw-r-- 1 noel cli115        218903 Jun 29 11:49 e3sm.log.1365799.230629-112302

Last lines in e3sm log:
    0: Note: nsplit=-1, while nsplit must be >=1. We know SCREAM does not know nsplit until runtime, so this is fine.
    0:       Make sure nsplit is set to a valid value before calling prim_advance_subcycle!
    0: gfr> nelemd 384 qsize 10
    0: compose> nelemd 384 qsize 10 hv_q 1 hv_subcycle_q 6 lim 9 independent_time_steps 1
    0:     P3_INIT (reading/creating look-up tables) ...
    0:

If I ogin to compute node while "hung", this is where I see it:

#0  0x00007fc8c9c400ef in pwrite64 () from /lib64/libpthread.so.0
E3SM-Project/scream#1  0x00007fc8cd817fc3 in ADIOI_CRAY_WriteContig () from /opt/cray/pe/lib64/libmpi_cray.so.12
E3SM-Project/scream#2  0x00007fc8cd81d4bc in ADIOI_CRAY_WriteStridedColl () from /opt/cray/pe/lib64/libmpi_cray.so.12
E3SM-Project/scream#3  0x00007fc8cd7ede59 in MPIOI_File_write_all () from /opt/cray/pe/lib64/libmpi_cray.so.12
E3SM-Project/scream#4  0x00007fc8cd7ef791 in PMPI_File_write_at_all () from /opt/cray/pe/lib64/libmpi_cray.so.12
E3SM-Project/scream#5  0x00007fc8cf7918a7 in move_file_block () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
E3SM-Project/scream#6  0x00007fc8cf791403 in move_record_vars () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
E3SM-Project/scream#7  0x00007fc8cf790d6f in ncmpio.enddef () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
E3SM-Project/scream#8  0x00007fc8cf6d4f43 in ncmpi_enddef () from /opt/cray/pe/parallel-netcdf/1.12.3.1/crayclang/14.0/lib/libpnetcdf_crayclang.so.4
E3SM-Project/scream#9  0x0000000001c4196d in pioc_change_def ()
E3SM-Project/scream#10 0x0000000001e6eb72 in eam_pio_enddef$scream_scorpio_interface_ ()
E3SM-Project/scream#11 0x0000000001e8cce6 in eam_pio_enddef_c2f ()
E3SM-Project/scream#12 0x0000000001e8a380 in scream::scorpio::eam_pio_enddef(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
E3SM-Project/scream#13 0x0000000001e94e90 in scream::OutputManager::setup_file(scream::IOFileSpecs&, scream::IOControl const&) ()
E3SM-Project/scream#14 0x0000000001e90adf in scream::OutputManager::setup(ekat::Comm const&, ekat::ParameterList const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<scream::FieldManager>, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<scream::FieldManager> > > > const&, std::shared_ptr<scream::GridsManager const> const&, scream::util::TimeStamp const&, scream::util::TimeStamp const&, bool) ()
E3SM-Project/scream#15 0x0000000001ceaa02 in scream::control::AtmosphereDriver::initialize_output_managers() ()
E3SM-Project/scream#16 0x00000000006199eb in scream_init_atm ()
E3SM-Project/scream#17 0x0000000000614a4a in atm_init_mct$atm_comp_mct_ ()
E3SM-Project/scream#18 0x000000000046ade0 in component_init_cc$component_mod_ ()
E3SM-Project/scream#19 0x0000000000437cde in cime_init$cime_comp_mod_ ()
E3SM-Project/scream#20 0x0000000000468963 in main ()

/lustre/orion/cli115/proj-shared/noel/e3sm_scratch/maf-jun26/t.maf-jun26.F2010-SCREAMv1.ne1024pg2_ne1024pg2.frontier-scream-gpu.n2048t8x6.vth200.era2019.SST.newo.cice0

Jun 29 '23 16:06 ndkeen

Luca B, Chris T, and I have been trying to debug this. Unable to find a reproducer at lower res than ne1024. Have tried a few other things without success and have 2 experiments in the Q.

Last night, one of those experiments worked out, which was a suggestion from Luca:

Restart:
  force_new_file: true

I understand this will write more files. It must be writing data to an output file, instead of trying to save that in a restart for the next job.

Jun 30 '23 20:06 ndkeen

Note that for a recent cess run (using the cess branch) on frontier, we forgot to include the restart force hack for some yaml files and the error seen in e3sm.log is below -- in case someone else hits this it might be a clue. Adding the restart force hack allowed it to run.

2148: terminate called after throwing an instance of 'std::logic_error'
2148:   what():  /global/cfs/cdirs/e3sm/ndk/repos/se70-jul19/components/eamxx/src/share/io/scream_io_utils.cpp:66: FAIL:
2148: found
2148: Error! Restart requested, but no restart file found in 'rpointer.atm'.
2148:    restart case name: output.scream.timestepINST
2148:    restart file type: history restart
2148:    rpointer content:
2148: ./t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om.scream.r.INSTANT.nyears_x1.0006-01-01-00000.nc
2148: t.gnu.F2010-SCREAMv1.ne30pg2_ne30pg2.pm-cpu.n022.om.scream.monthly.rhist.AVERAGE.nyears_x1.0006-01-01-00000.nc

Jul 27 '23 21:07 ndkeen

Last night, one of those experiments worked out, which was a suggestion from Luca:
Restart:
  force_new_file: true
I understand this will write more files. It must be writing data to an output file, instead of trying to save that in a restart for the next job.

For clarity: the default upon restart is to resume the last nc file (assuming we did not already reach the max snap per file). All that force_new_file does is to start a new nc file, regardless of how much data was written in the last output file.

Aug 03 '23 20:08 bartgol

@ndkeen I forgot whether we fixed this or not. Are we still using force_new_file: true in our yaml files?

Dec 12 '23 16:12 bartgol

We used it for the Cess runs on frontier. I am not sure if there may have been change to master that might have addressed it.

Was used in these files:

frontier% grep force *yaml
scream_output.Cess.3hourlyAVG_ne120.yaml:  force_new_file: true
scream_output.Cess.3hourlyINST_ne120.yaml:  force_new_file: true
scream_output.Cess.3hourly_ne1024.yaml:  force_new_file: true
scream_output.Cess.6hourlyAVG_ne30.yaml:  force_new_file: true
scream_output.Cess.6hourlyINST_ne30.yaml:  force_new_file: true
scream_output.Cess.ACI_regions_2D.yaml:  force_new_file: true
scream_output.Cess.ARM_sites_2D.yaml:  force_new_file: true
scream_output.Cess.ARM_sites_3D.yaml:  force_new_file: true
scream_output.Cess.hourly_2Dvars.yaml:  force_new_file: true

Dec 12 '23 16:12 ndkeen

Ok, thanks. I remember we found some issue with remapping, and didn't recall if this was fixed. I hope to find the time to get to this at some point...

Dec 12 '23 16:12 bartgol

Revisiting this issue ...

@ndkeen would it be possible to submit one of your cess-v2-like runs like above but without the force_new_file: true option? If not, I can try to diagnose it in other setups (e.g., decada/aerosol) or we might as well run in both setups for more info...?

Nov 04 '24 22:11 mahf708

I'm just going to transfer to e3sm for now. It may be someone already knows this is resolved. Otherwise could take some work to reproduce.

Dec 06 '24 19:12 ndkeen