E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Simplify PIO calls for interface restartvar in ELM

Open dqwu opened this issue 1 year ago • 10 comments

This PR effectively reduces some redundant PIO calls for interface restartvar in ELM.

Fixes #6384

[BFB]

dqwu avatar May 03 '24 21:05 dqwu

PR Preview Action v1.4.7 :---: :rocket: Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6394/ on branch gh-pages at 2024-05-07 15:56 UTC

github-actions[bot] avatar May 03 '24 22:05 github-actions[bot]

@bishtgautam @jayeshkrishna Since some CI tests failed (e.g. ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu) and I could not reproduce them on ANL workstations, I will add a temp commit with more DEBUG info to find out the issue. That temp commit will be dropped before we merge this PR to E3SM next.

dqwu avatar May 06 '24 20:05 dqwu

@dqwu usually, testing with gnu on any machine will yield similar results to the singularity_gnu tests. Let me know if you cannot reproduce them after your updates and I can take a look

mahf708 avatar May 07 '24 17:05 mahf708

@dqwu usually, testing with gnu on any machine will yield similar results to the singularity_gnu tests. Let me know if you cannot reproduce them after your updates and I can take a look

Thanks for your info. It might be related to some un-synced MPI calls on gh/ci. I have added some mpi_barrier calls to confirm.

dqwu avatar May 07 '24 17:05 dqwu

@mahf708 It seems that gh/ci needs to redownload the missing input files each time it is launched? I think it should be configured to download the input files to a persistent directory.

...
  errput: File not found: domainfile = /github/home/projects/e3sm/cesm-inputdata/share/domains/domain.ocn.oQU480.151209.nc, will attempt to download in check_input_data phase
File not found: atm2ocn_fmapname = /github/home/projects/e3sm/cesm-inputdata/cpl/gridmaps/ne4pg2/map_ne4pg2_to_oQU480_mono.200527.nc, will attempt to download in check_input_data phase
...

dqwu avatar May 07 '24 17:05 dqwu

@mahf708 Even with MPI barriers, "NetCDF: Variable not found" is still reproducible. Could you please manually run ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu on the gh/ci machine to confirm? If possible, I can provide a scorpio feature branch for debugging.

 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101016_dyjvdw.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERS_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101016_dyjvdw.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
Obtained 10 stack frames.

dqwu avatar May 07 '24 17:05 dqwu

@mahf708 I checked the other failed tests. It seems that they all failed to find variable DZSNO when putting a attribute to it.

 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
 [DEBUG] 3: varid returned by ncd_defvar =           61 , varname = DZSNO
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERP_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101019_ic5h2b.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Variable not found (file = ./ERP_P4.ne4pg2_oQU480.F2010.singularity_gnu.20240507_101019_ic5h2b.elm.r.0001-01-07-00000.nc) (/__w/E3SM/E3SM/externals/scorpio/src/clib/pio_getput_int.c: 490)
Obtained 10 stack frames.

dqwu avatar May 07 '24 17:05 dqwu

@mahf708 @jayeshkrishna @bishtgautam Never mind, I think I found the issue. The following code uses vardesc%varid, which is an uninitialized value:

       if (switchdim) then
          status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag', 1)
       else
          status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag', 0)
       end if
       status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag_values', (/0,1/))
       status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag_is_0', &
            "1st and 2nd dims are same as model representation")
       status = PIO_put_att(ncid, vardesc%varid, 'switchdim_flag_is_1', &
            "1st and 2nd dims are switched from model representation")

dqwu avatar May 07 '24 17:05 dqwu

@mahf708 I will update this PR to fix the failed tests. It seems that the uninitialized vardesc%varid happens to be a valid value on ANL workstations, but an invalid value on gh/ci machine. The tests run by gh/ci are helpful to catch this issue which might not be reproducible on some other machines.

dqwu avatar May 07 '24 18:05 dqwu

Uninitialized vardesc%varid has different values on different machines.

gh/ci runners:

 [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =            1
 [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =           52
 [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =          102
 [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =          152
...
 [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =   1707001376
 [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =    728158864
 [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =   -130156320
 [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =   -826678848
...

ANL workstations:

[0]  [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =           20
[1]  [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =           20
[2]  [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =           20
[3]  [DEBUG]: varid returned by ncd_defvar =           61 , varname = DZSNO, vardesc%varid =           20
...
[0]  [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =           73
[1]  [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =           73
[2]  [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =           73
[3]  [DEBUG]: varid returned by ncd_defvar =          139 , varname = sabs_roof_dir, vardesc%varid =           73
...

dqwu avatar May 07 '24 19:05 dqwu