E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

PIO: FATAL ERROR when opening symlink files with cases on GCP (older netCDF-4 format causes errors on GCP)

Open ndkeen opened this issue 2 years ago • 16 comments

Is it known what this error message means? Using GCP, the following test fails SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp

But the file in message seems OK

  0: PIO: FATAL ERROR: Aborting... FATAL ERROR: Permission denied (file = v2.LR.historical_0101.eam.i.2015-01-01-00000.nc) (/home/noel/wacmy/m06-jul14/externals/scorpio/src/clib/pioc_support.c: 3622)
  0: Obtained 10 stack frames.
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5f7f328]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5f7f4ec]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5f7f8aa]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5f864ff]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5f86a48]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5f7d644]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x5eda43e]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x766200]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x6c08ce]
  0: /home/noel/e3sm/scratch/m06-jul14/SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp.20220714_174642_enc0no/bld/e3sm.exe() [0x6014cd]
  0: --------------------------------------------------------------------------
  0: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  0: with errorcode -1.
  0:
  0: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
  0: You may or may not see output from other processes, depending on
  0: exactly when Open MPI kills them.
  0: --------------------------------------------------------------------------
  0: PIO: WARNING: Opening file (v2.LR.historical_0101.eam.i.2015-01-01-00000.nc) with iotype=1 (PIO_IOTYPE_PNETCDF) failed (ierr=-77, NetCDF: Access failure). Retrying with iotype=PIO_IOTYPE_NETCDF
  0: slurmstepd: error: *** STEP 5143.0 ON gcp-e3sm10-compute-0-0 CANCELLED AT 2022-07-14T17:52:22 ***
gcp-e3sm10-login0% ls -l /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate 1207252416 Jun 27 16:13 /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
gcp-e3sm10-login0% ncdump -k /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
64-bit offset

ndkeen avatar Jul 14 '22 18:07 ndkeen

Looks like a file access permission issue to me, @sarich can this file permissions be changed to 755 to see if that helps?

jayeshkrishna avatar Jul 14 '22 18:07 jayeshkrishna

I did notice the file permissions were diff for some files in that dir. But you see I can ncdump (and copy) the file.

gcp-e3sm10-login0% ls -l /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/
total 7196560
drwxrwsr-x 2 jason_sarich climate       4096 Jul 14 17:59 ./
drwxrwsr-x 3 jason_sarich climate         30 Jun 27 14:42 ../
-rw-rw-r-- 1 jason_sarich climate        541 Jun 27 14:50 rpointer.atm
-rw-rw-r-- 1 jason_sarich climate        257 Jun 27 14:50 rpointer.drv
-rw-rw-r-- 1 jason_sarich climate         21 Jun 27 14:50 rpointer.ice
-rw-rw-r-- 1 jason_sarich climate        257 Jun 27 14:50 rpointer.lnd
-rw-rw-r-- 1 jason_sarich climate         21 Jun 27 14:50 rpointer.ocn
-rw-rw-r-- 1 jason_sarich climate        257 Jun 27 14:50 rpointer.rof
-rw-r----- 1 jason_sarich climate  319453040 Jun 27 16:05 v2.LR.historical_0101.cpl.r.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate 1207252416 Jun 27 16:13 v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate 3400387940 Jun 27 16:32 v2.LR.historical_0101.eam.r.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate   11583232 Jun 27 16:32 v2.LR.historical_0101.eam.rs.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate    1649512 Jun 27 16:32 v2.LR.historical_0101.elm.h1.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate  389069336 Jun 27 16:35 v2.LR.historical_0101.elm.r.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate     468796 Jun 27 16:35 v2.LR.historical_0101.elm.rh0.2015-01-01-00000.nc
-rw-r----- 1 jason_sarich climate     468796 Jun 27 16:35 v2.LR.historical_0101.elm.rh1.2015-01-01-00000.nc
-rw-rw---- 1 jason_sarich climate   45625888 Jun 27 15:19 v2.LR.historical_0101.mosart.r.2015-01-01-00000.nc
-rw-rw---- 1 jason_sarich climate     119856 Jun 27 15:27 v2.LR.historical_0101.mosart.rh0.2015-01-01-00000.nc
-rw-rw---- 1 jason_sarich climate     113712 Jun 27 15:27 v2.LR.historical_0101.mosart.rh1.2015-01-01-00000.nc
-rw-rw---- 1 jason_sarich climate 1238410072 Jun 27 15:11 v2.LR.historical_0101.mpaso.rst.2015-01-01_00000.nc
-rw-rw---- 1 jason_sarich climate  754605984 Jun 27 15:01 v2.LR.historical_0101.mpassi.rst.2015-01-01_00000.nc

ndkeen avatar Jul 14 '22 18:07 ndkeen

I don't think it's a file permission problem, this is failing for me as well. I am getting a lock file for that .nc file in the run directory, I don't know if that's a cause or effect of this error.

sarich avatar Jul 14 '22 19:07 sarich

I did notice those .lock files in one of the inputdata directories. I removed those (doubt that helps) and then added a printf just before the line of code that aborts and I get a diff message:

  0: ndk filename=v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
  0: ndk filename=/home/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc
  0:
  0:  getMetaSchedule: tmpP:           12           1          11          35         135          36           1          70         136         360           2         361         105         449         104         459
  0:  initializing elements...
  0:  ERROR: HANDLE_NCERR
  0: #0  0x5c10e39 in __shr_abort_mod_MOD_shr_abort_backtrace
  0:    at /home/noel/wacmy/m06-jul14/share/util/shr_abort_mod.F90:104
  0: #1  0x5c1101b in __shr_abort_mod_MOD_shr_abort_abort
  0:    at /home/noel/wacmy/m06-jul14/share/util/shr_abort_mod.F90:61
  0: #2  0x72ef35 in __cam_abortutils_MOD_endrun
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/utils/cam_abortutils.F90:59
  0: #3  0x19b9cad in __error_messages_MOD_handle_ncerr
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/control/error_messages.F90:113
  0: #4  0x1046594 in ghg_ramp_read
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/physics/cam/chem_surfvals.F90:309
  0: #5  0x1047ac9 in __chem_surfvals_MOD_chem_surfvals_init
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/physics/cam/chem_surfvals.F90:206
  0: #6  0x1705869 in __inital_MOD_cam_initial
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/dynamics/se/inital.F90:58
  0: #7  0x60150f in __cam_comp_MOD_cam_init
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/control/cam_comp.F90:159
  0: #8  0x5efc55 in __atm_comp_mct_MOD_atm_init_mct
  0:    at /home/noel/wacmy/m06-jul14/components/eam/src/cpl/atm_comp_mct.F90:320
  0: #9  0x45009a in __component_mod_MOD_component_init_cc
  0:    at /home/noel/wacmy/m06-jul14/driver-mct/main/component_mod.F90:248
  0: #10  0x43765b in __cime_comp_mod_MOD_cime_init
  0:    at /home/noel/wacmy/m06-jul14/driver-mct/main/cime_comp_mod.F90:1438
  0: #11  0x448ed3 in cime_driver
  0:    at /home/noel/wacmy/m06-jul14/driver-mct/main/cime_driver.F90:122
  0: #12  0x449013 in main
  0:    at /home/noel/wacmy/m06-jul14/driver-mct/main/cime_driver.F90:23

Where that file seems OK as well:

gcp-e3sm10-login0% ls -l /home/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc
-rw-rw-r-- 1 noel climate 1600324 Jan 24  2020 /home/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc
gcp-e3sm10-login0% ncdump -k /home/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc
64-bit offset

ndkeen avatar Jul 14 '22 20:07 ndkeen

ok, the lock files might just be a symptom not the cause of the issue.

One thing I do notice is that the filename does not include the full path.

PIO: WARNING: Opening file (v2.LR.historical_0101.eam.i.2015-01-01-00000.nc) with iotype=1 ...

@ndkeen what are the two debug outputs (at the beginning of the output " ndk filename= ...") in your run that prints out the filename?

jayeshkrishna avatar Jul 14 '22 21:07 jayeshkrishna

The source change I made is in src/clib/pioc_support.c:

@@ -3619,6 +3619,7 @@ int PIOc_openfile_retry(int iosysid, int *ncidp, int *iotype, const char *filena
         return check_mpi(NULL, file, mpierr, __FILE__, __LINE__);
     }
 
+    printf("ndk filename=%s\n", filename);
     ierr = check_netcdf(ios, file, ierr, __FILE__, __LINE__);
     /* If there was an error, free allocated memory and deal with the error. */

ndkeen avatar Jul 14 '22 21:07 ndkeen

@wlin7 : Do you know how the above filename (v2.LR.historical_0101.eam.i.2015-01-01-00000.nc) gets picked up and why it does not have the full path?

jayeshkrishna avatar Jul 14 '22 21:07 jayeshkrishna

Note that in the run directory, there are multiple softlinks which do point to fullpaths (which seem ok), including this above noted netcdf file:

lrwxrwxrwx  1 noel noel        150 Jul 14 17:46 v2.LR.historical_0101.cpl.r.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.cpl.r.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        150 Jul 14 17:46 v2.LR.historical_0101.eam.i.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
-rw-r--r--  1 noel noel          8 Jul 14 19:55 v2.LR.historical_0101.eam.i.2015-01-01-00000.nc-337969152-4968.lock
lrwxrwxrwx  1 noel noel        150 Jul 14 17:46 v2.LR.historical_0101.eam.r.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.eam.r.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        151 Jul 14 17:46 v2.LR.historical_0101.eam.rs.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.eam.rs.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        151 Jul 14 17:46 v2.LR.historical_0101.elm.h1.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.elm.h1.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        150 Jul 14 17:46 v2.LR.historical_0101.elm.r.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.elm.r.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        152 Jul 14 17:46 v2.LR.historical_0101.elm.rh0.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.elm.rh0.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        152 Jul 14 17:46 v2.LR.historical_0101.elm.rh1.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.elm.rh1.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        153 Jul 14 17:46 v2.LR.historical_0101.mosart.r.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.mosart.r.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        155 Jul 14 17:46 v2.LR.historical_0101.mosart.rh0.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.mosart.rh0.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        155 Jul 14 17:46 v2.LR.historical_0101.mosart.rh1.2015-01-01-00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.mosart.rh1.2015-01-01-00000.nc
lrwxrwxrwx  1 noel noel        154 Jul 14 17:46 v2.LR.historical_0101.mpaso.rst.2015-01-01_00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.mpaso.rst.2015-01-01_00000.nc
lrwxrwxrwx  1 noel noel        155 Jul 14 17:46 v2.LR.historical_0101.mpassi.rst.2015-01-01_00000.nc -> /home/inputdata/e3sm_init/V2.SSP370_SSP585.ne30pg2_EC30to60E2r2/v2.LR.historical_0101/2015-01-01-00000/v2.LR.historical_0101.mpassi.rst.2015-01-01_00000.nc

ndkeen avatar Jul 14 '22 21:07 ndkeen

@ndkeen : can you ncdump on the file using the soft link?

jayeshkrishna avatar Jul 14 '22 21:07 jayeshkrishna

Yes.

gcp-e3sm10-login0% ncdump -k run/v2.LR.historical_0101.eam.i.2015-01-01-00000.nc
64-bit offset

ndkeen avatar Jul 14 '22 22:07 ndkeen

It does seem like the actual error might be regarding ndk filename=/home/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc as that's the last file it's trying to read before abort. ?

ndkeen avatar Jul 14 '22 22:07 ndkeen

I just tried SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu (ie without the mods). I think this test does not try to make the symlinks. I also get a fail for this test similar as error above. I think it's now filename=/home/inputdata/atm/cam/topo/USGS-gtopo30_ne30np4pg2_16xdel2.c20200108.nc at issue

  0:  initializing elements...
  0:  ERROR: HANDLE_NCERR
  0: #0  0x598dcfc in __shr_abort_mod_MOD_shr_abort_backtrace
  0:    at /home/noel/wacmy/m07-jul20/share/util/shr_abort_mod.F90:104
  0: #1  0x598dede in __shr_abort_mod_MOD_shr_abort_abort
  0:    at /home/noel/wacmy/m07-jul20/share/util/shr_abort_mod.F90:61
  0: #2  0x72f43a in __cam_abortutils_MOD_endrun
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/utils/cam_abortutils.F90:59
  0: #3  0x191b5a5 in __error_messages_MOD_handle_ncerr
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/control/error_messages.F90:113
  0: #4  0x1046a99 in ghg_ramp_read
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/physics/cam/chem_surfvals.F90:309
  0: #5  0x1047fce in __chem_surfvals_MOD_chem_surfvals_init
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/physics/cam/chem_surfvals.F90:206
  0: #6  0x1667162 in __inital_MOD_cam_initial
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/dynamics/se/inital.F90:58
  0: #7  0x601a14 in __cam_comp_MOD_cam_init
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/control/cam_comp.F90:159
  0: #8  0x5f015a in __atm_comp_mct_MOD_atm_init_mct
  0:    at /home/noel/wacmy/m07-jul20/components/eam/src/cpl/atm_comp_mct.F90:320
  0: #9  0x44ffea in __component_mod_MOD_component_init_cc
  0:    at /home/noel/wacmy/m07-jul20/driver-mct/main/component_mod.F90:248
  0: #10  0x4375ab in __cime_comp_mod_MOD_cime_init
  0:    at /home/noel/wacmy/m07-jul20/driver-mct/main/cime_comp_mod.F90:1438
  0: #11  0x448e23 in cime_driver
  0:    at /home/noel/wacmy/m07-jul20/driver-mct/main/cime_driver.F90:122
  0: #12  0x448f63 in main
  0:    at /home/noel/wacmy/m07-jul20/driver-mct/main/cime_driver.F90:23
  0: --------------------------------------------------------------------------

ndkeen avatar Jul 20 '22 21:07 ndkeen

OK I think I figured this out. The place where it's stopping is not right before a check_netcdf() call, but instead in GHG routine. Adding a print to see the file it's actually trying to open yields:

gcp-e3sm10-login0% ncdump -k /home/inputdata/atm/cam/ggas/GHG_CMIP_SSP370-1-2-1_Annual_Global_2015-2500_c20210509.nc
netCDF-4

Another netCDF-4 formatted file. How best to get these files changed to the correct format? And more importantly, ensure that no new files are added to inputdata with old formats?

Here is where file is opened: components/eam/src/physics/cam/chem_surfvals.F90

    if (masterproc) then
      call getfil (bndtvghg, locfn, 0)
+     print*, "ndk ghg_ramp_read opening file=", trim(locfn) 
      call handle_ncerr( nf90_open (trim(locfn), NF90_NOWRITE, ncid),subname,__LINE__)

Could we perhaps have a standard routine to open files and then when in DEBUG (or some other flag used), it shows the filename of each file before being opened to reduce debugging time?

Same issue as https://github.com/E3SM-Project/E3SM/issues/4767

In fact I see I had already cheated by overwriting one of the files in this dir to use cdf5:

gcp-e3sm10-login0% ls -l /home/inputdata/atm/cam/ggas/
total 271312
drwxrwsr-x  2 noel         climate       324 Jun 27 15:41 ./
drwxrwsr-x 15 noel         climate       225 Jan 28 21:17 ../
-rw-rw-r--  1 noel         climate    104972 Jan 31 19:18 GHG_CMIP-1-2-0_Annual_Global_0000-2014_c20180105.nc
-rw-rw-r--  1 noel         climate    862994 Jan 31 18:57 GHG_CMIP-1-2-0_Annual_Global_0000-2014_c20180105.nc-original
-rw-rw-r--  1 jason_sarich climate     89531 Oct  7  2021 GHG_CMIP_SSP370-1-2-1_Annual_Global_2015-2500_c20210509.nc
-rw-rw-r--  1 jason_sarich climate 276760640 Nov 12  2020 ne30pg2_eam_CO2-em-anthro_input4MIPs_emissions_CMIP_CEDS-2017-05-18_gr_175001-201412_c20201111.nc


gcp-e3sm10-login0% ncdump -k /home/inputdata/atm/cam/ggas/GHG_CMIP-1-2-0_Annual_Global_0000-2014_c20180105.nc-original
netCDF-4 classic model
gcp-e3sm10-login0% ncdump -k /home/inputdata/atm/cam/ggas/GHG_CMIP-1-2-0_Annual_Global_0000-2014_c20180105.nc
cdf5

Not sure why I originally saw the permission error -- that seemed to have gone away after removing certain .lock files.

ndkeen avatar Jul 20 '22 23:07 ndkeen

Thanks @ndkeen ! We might need a script that scans (weekly? daily?) the file format of files in the inputdata directory to ensure we don't have these issues with NetCDF4 files.

jayeshkrishna avatar Jul 21 '22 14:07 jayeshkrishna

Well, several years ago I tabulated the file formats of all files in inputdata. I think we just need to fix the commonly used files for now (how?) and find a way to make sure any new files added have right format (I'm sure they get added in a variety of ways, but esp those that are generated with a script). But if my goal was to make GCP better for more general use, I would try again to find a way for it to allow reading those older formatted files. I did try previously and found that netcdf was configured in same way as on cori...

ndkeen avatar Jul 21 '22 17:07 ndkeen

@sarich had gone through the input files and converted them to 64bit-offset a while back (https://acme-climate.atlassian.net/wiki/spaces/EIDMG/pages/921436224/Netcdf+file+conversion). So you can use the ncks (NetCDF Kitchen sink) command to do the conversion (https://acme-climate.atlassian.net/wiki/spaces/EIDMG/pages/921436224/Netcdf+file+conversion).

One of the reasons we avoid the NetCDF4 file format is due to the issues (hangs, crashes) that we have had in the past with the NetCDF software stack. So you might be better off just converting the files.

jayeshkrishna avatar Jul 21 '22 17:07 jayeshkrishna

We may be beyond the original errors for this issue and have some new ones.

For the test SMS_PMx1_D_Ld1.ne30pg2_EC30to60E2r2.WCYCLSSP370.gcp_gnu.allactive-wcprodssp (which is likely similar to without PMx1), I see that the test completes if I set

finidat=' '

in user_nl_elm.

Also, if I leave that line out (ie original) and change this flag to true: CHECK_FINIDAT_FSURDAT_CONSISTENCY = .true.

I get following error:


 ERROR: Initial conditions file (finidat) was generated from a different surface dataset
 than the one being used for the current simulation (fsurdat).
 Current fsurdat: surfdata_ne30np4.pg2_SSP3_RCP70_simyr2015_c220420.nc
 Surface dataset used to generate initial conditions file: surfdata_ne30np4.pg2_simyr1850_c201210.nc

 Possible solutions to this problem:
 (1) Make sure you are using the correct surface dataset and initial conditions file
 (2) If you generated the surface dataset and/or initial conditions file yourself,
     then you may need to manually change the surface_dataset global attribute on the
     initial conditions file (e.g., using ncatted)
 (3) If you are confident that you are using the correct surface dataset and initial conditions file,
     yet are still experiencing this error, then you can bypass this check by setting:
       check_finidat_fsurdat_consistency = .false.
     in user_nl_elm
  
 ENDRUN:ERROR in /home/noel/wacmy/nexty-sep6/components/elm/src/main/restFileMod.F90 at line 1336      

Which may just be saying "turn it back to false" ...

Note that SMS_D_Ld1.ne30pg2_EC30to60E2r2.WCYCL1850.gcp_gnu.allactive-wcprod completes on GCP

The actual error message I see with original test is now:

127: #0  0x2ad1dabd13ff in ???
127: #1  0x2ad1d9fc6659 in get_float_string
127:    at /tmp/root/spack-stage/spack-stage-gcc-11.1.0-bm2qpp7qjfzn3evldu54qpltiwaqh2ue/spack-src/libgfortran/io/write_float.def:1114
127: #2  0x2ad1d9fc8e55 in list_formatted_write_scalar
127:    at /tmp/root/spack-stage/spack-stage-gcc-11.1.0-bm2qpp7qjfzn3evldu54qpltiwaqh2ue/spack-src/libgfortran/io/write.c:1903
127: #3  0x268c5b5 in hist_update_hbuf_field_1d
127:    at /home/noel/wacmy/nexty-sep6/components/elm/src/main/histFileMod.F90:1216
127: #4  0x268e9be in __histfilemod_MOD_hist_update_hbuf
127:    at /home/noel/wacmy/nexty-sep6/components/elm/src/main/histFileMod.F90:1009
127: #5  0x2521a9c in __elm_driver_MOD_elm_drv
127:    at /home/noel/wacmy/nexty-sep6/components/elm/src/main/elm_driver.F90:1449
127: #6  0x24e91fb in __lnd_comp_mct_MOD_lnd_run_mct
127:    at /home/noel/wacmy/nexty-sep6/components/elm/src/cpl/lnd_comp_mct.F90:508
127: #7  0x44c8f9 in __component_mod_MOD_component_run

Where I've added some print* statements and see it is indeed the field() value that is an issue.

Adding prints, I see the fail is when trying to write data for this field: name=TWS_MONTH_BEGIN

ndkeen avatar Sep 06 '22 20:09 ndkeen

Hi @ndkeen , for this SSP test, we do intend to use the finidat and fsurdat as specified; and to make it work, we need CHECK_FINIDAT_FSURDAT_CONSISTENCY = .false.

wlin7 avatar Sep 06 '22 20:09 wlin7

Wuyin: OK that's fine and the message in user_nl_elm suggests that as well. I thought it might help in debugging this issue, but apparently not.

ndkeen avatar Sep 06 '22 21:09 ndkeen

Clearly we are hitting a different error than the original one. It's not obvious how we got beyond that original error, but I will make a new issue and close this one.

ndkeen avatar Sep 08 '22 23:09 ndkeen