CMEPS icon indicating copy to clipboard operation
CMEPS copied to clipboard

Driver dies with a seg-fault rather than a graceful abort if DRV_RESTART_POINTER file does not exist

Open ekluzek opened this issue 1 year ago • 1 comments

If the file pointed to by DRV_RESTART_POINTER does not exist, the driver fails with a seg-fault rather than writing a graceful exit about the file not existing.

This is in what will be ctsm5.3.016 with cime6.1.49 and cmeps1.0.32

The full description is here:

https://github.com/ESCOMP/CTSM/issues/2914

The tests that fail are:

ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly ERS_P128x1_Ld765.f10_f10_mg37.I2000Clm60Fates.derecho_intel.clm-FatesColdNoComp

In the cesm.log file for the first, only the cesm.log file is generated

cesm.log

cat /glade/derecho/scratch/erik/tests_ctsm5316acl/ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly.GC.ctsm5316acl_int/run/case2run/cesm.log.7269007.desched1.241218-154730
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf) Read in prof_inparm namelist from: drv_in
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf) Using profile_disable=          F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_timer=                      4
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_depth_limit=               12
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_detail_limit=               2
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_barrier=          F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_num=                  1
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_outpe_stride=               0
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_single_file=      F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_global_stats=     T
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_ovhd_measurement= F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_add_detail=       F
dec2343.hsn.de.hpc.ucar.edu 0:  (t_initf)       profile_papi_enable=      F
dec2343.hsn.de.hpc.ucar.edu 0:  ESMF_Finalize: Error closing trace stream
dec2343.hsn.de.hpc.ucar.edu 0: MPICH ERROR [Rank 0] [job id 2dd16cc6-e949-427e-bb59-48726c16f9fa] [Wed Dec 18 15:47:41 2024] [dec2343] - Abort(1) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 0
dec2343.hsn.de.hpc.ucar.edu 0: 
dec2343.hsn.de.hpc.ucar.edu 0: forrtl: severe (174): SIGSEGV, segmentation fault occurred
dec2343.hsn.de.hpc.ucar.edu 0: Image              PC                Routine            Line        Source             
dec2343.hsn.de.hpc.ucar.edu 0: libpthread-2.31.s  000015004133C8C0  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003F2FBE7E  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003F10A22F  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libmpi_intel.so.1  000015003D7376A8  MPI_Abort             Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049332277  _ZN5ESMCI3VMK5abo     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049330814  _ZN5ESMCI2VM5abor     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         00001500493476E5  c_esmc_vmabort_       Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         0000150049B5C7A8  esmf_vmmod_mp_esm     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libesmf.so         00001500499CC1EE  esmf_initmod_mp_e     Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           0000000000433ADA  MAIN__                    132  esmApp.F90
dec2343.hsn.de.hpc.ucar.edu 0: cesm.exe           00000000004230FD  Unknown               Unknown  Unknown
dec2343.hsn.de.hpc.ucar.edu 0: libc-2.31.so       000015003C7E129D  __libc_start_main     Unknown  Unknown

drv.log:

cat /glade/derecho/scratch/erik/tests_ctsm5316acl/ERP_P64x2_Ld765.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-monthly.GC.ctsm5316acl_int/run/case2run/drv.log.7269007.desched1.241218-154730
  read rpointer file = rpointer.cpl.2001-01-18-00000

ekluzek avatar Dec 19 '24 00:12 ekluzek

Looking at the code, there is error handling for this as follows:

cesm/driver/esm_time_mod.F90:

          call NUOPC_CompAttributeGet(instance_driver, name='drv_restart_pointer', value=restart_pfile, rc=rc)
          if (ChkErr(rc,__LINE__,u_FILE_u)) return

          if (trim(restart_pfile) /= 'none') then

             if (maintask) then
                write(logunit,*) " read rpointer file = "//trim(restart_pfile)
                inquire( file=trim(restart_pfile), exist=exists)
                if (.not. exists) then
                   rc = ESMF_FAILURE
                   call ESMF_LogWrite(trim(subname)//' ERROR rpointer file '//trim(restart_pfile)//' not found', &
                        ESMF_LOGMSG_ERROR, line=__LINE__, file=__FILE__)
                   return
                endif

So it outputs to the ESMF PET files, but no PET files were created with the case.

ekluzek avatar Dec 19 '24 00:12 ekluzek