Simplify PIO calls for interface restartvar in ELM
This PR effectively reduces some redundant PIO calls for interface restartvar in ELM.
Fixes #6384
[BFB]
PR Preview Action v1.4.7
:---:
:rocket: Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6394/
on branch gh-pages at 2024-05-07 15:56 UTC
@bishtgautam @jayeshkrishna Since some CI tests failed (e.g. ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu) and I could not reproduce them on ANL workstations, I will add a temp commit with more DEBUG info to find out the issue. That temp commit will be dropped before we merge this PR to E3SM next.
@dqwu usually, testing with gnu on any machine will yield similar results to the singularity_gnu tests. Let me know if you cannot reproduce them after your updates and I can take a look
@dqwu usually, testing with gnu on any machine will yield similar results to the singularity_gnu tests. Let me know if you cannot reproduce them after your updates and I can take a look
Thanks for your info. It might be related to some un-synced MPI calls on gh/ci. I have added some mpi_barrier calls to confirm.
@mahf708 It seems that gh/ci needs to redownload the missing input files each time it is launched? I think it should be configured to download the input files to a persistent directory.
...
errput: File not found: domainfile = /github/home/projects/e3sm/cesm-inputdata/share/domains/domain.ocn.oQU480.151209.nc, will attempt to download in check_input_data phase
File not found: atm2ocn_fmapname = /github/home/projects/e3sm/cesm-inputdata/cpl/gridmaps/ne4pg2/map_ne4pg2_to_oQU480_mono.200527.nc, will attempt to download in check_input_data phase
...
@mahf708 Even with MPI barriers, "NetCDF: Variable not found" is still reproducible. Could you please manually run ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu on the gh/ci machine to confirm? If possible, I can provide a scorpio feature branch for debugging.
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101016_dyjvdw.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101016_dyjvdw.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
Obtained 10 stack frames.
@mahf708 I checked the other failed tests. It seems that they all failed to find variable DZSNO when putting a attribute to it.
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
[DEBUG] 3: varid returned by ncd_defvar = 61 , varname = DZSNO
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERP_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101019_ic5h2b.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERP_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101019_ic5h2b.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
Obtained 10 stack frames.
@mahf708 @jayeshkrishna @bishtgautam Never mind, I think I found the issue. The following code uses vardesc%varid, which is an uninitialized value:
if (switchdim) then
status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag', 1)
else
status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag', 0)
end if
status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag_values', (/0,1/))
status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag_is_0', &
"1st and 2nd dims are same as model representation")
status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag_is_1', &
"1st and 2nd dims are switched from model representation")
@mahf708 I will update this PR to fix the failed tests. It seems that the uninitialized vardesc%varid happens to be a valid value on ANL workstations, but an invalid value on gh/ci machine. The tests run by gh/ci are helpful to catch this issue which might not be reproducible on some other machines.
Uninitialized vardesc%varid has different values on different machines.
gh/ci runners:
[DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 1
[DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 52
[DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 102
[DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 152
...
[DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = 1707001376
[DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = 728158864
[DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = -130156320
[DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = -826678848
...
ANL workstations:
[0] [DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 20
[1] [DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 20
[2] [DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 20
[3] [DEBUG]: varid returned by ncd_defvar = 61 , varname = DZSNO, vardesc%varid = 20
...
[0] [DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = 73
[1] [DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = 73
[2] [DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = 73
[3] [DEBUG]: varid returned by ncd_defvar = 139 , varname = sabs_roof_dir, vardesc%varid = 73
...