E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

spurious error on pm-cpu

Open ndkeen opened this issue 1 year ago • 3 comments

With master (Aug7th) I've been trying various short cases to look at pelayout performance for several configs. One error I'm seeing with a frequency enough to create an issue. I don't yet see any pattern (except does not happen without ATM). These are runs on pm-cpu with Intel.

For this repo, I've hit the error 17 times out of 127 attempts (some are F cases, other WC. Some are ne30, others ne120)

The sources written in error mesg may not be a good hint to the issue -- so I don't want to assume it's an issue with Kokkos. Note I also get this error when I tried an incorrect resolution.

 129: *** ma_convproc_tend error, massbal2    59 so4_a5           -- maxflux, sumflux, relerr =  4.283824E-14 -3.235764E-20 -7.553448E-07
 129: *** ma_convproc_tend error, massbal2    63 soa_a1           -- maxflux, sumflux, relerr =  8.537513E-13 -7.612242E-19 -8.916229E-07
 129: *** ma_convproc_tend error, massbal2    67 bc_a3            -- maxflux, sumflux, relerr =  1.892707E-15  4.001221E-23  2.114020E-08
 129: *** ma_convproc_tend error, massbal4    69 dst_a1           -- maxflux, sumflux, relerr =  1.071177E-12 -1.206128E-20 -1.125985E-08
...
 129: MPICH Notice [Rank 129] [job id 13886964.0] [Mon Aug 14 08:34:57 2023] [nid004324] - Abort(0) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 129
 129:
 129: aborting job:
 129: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 129
 129: forrtl: severe (174): SIGSEGV, segmentation fault occurred
 129: Image              PC                Routine            Line        Source
 129: libpthread-2.31.s  00001529989738C0  Unknown               Unknown  Unknown
 129: e3sm.exe           00000000047F0334  Unknown               Unknown  Unknown
 129: e3sm.exe           00000000048089E5  Unknown               Unknown  Unknown
 129: e3sm.exe           0000000004808BF1  Unknown               Unknown  Unknown
 129: e3sm.exe           0000000001FB9AC1  ~SharedAllocation         309  Kokkos_SharedAlloc.hpp
 129: e3sm.exe           00000000048118C1  Unknown               Unknown  Unknown
 129: e3sm.exe           0000000001FDE1D6  ~IslMpi                   447  compose_slmm_islmpi.hpp
 129: e3sm.exe           0000000001FD9D9C  ~__shared_ptr             154  shared_ptr_base.h
 129: libc-2.31.so       000015299804BAE9  Unknown               Unknown  Unknown
 129: libc-2.31.so       000015299804BC7A  Unknown               Unknown  Unknown
 129: libpmi2.so.0       00001529959EF9E4  PMI_CRAY_Get_base     Unknown  Unknown
 129: libmpi_intel.so.1  000015299AF906E2  Unknown               Unknown  Unknown
 129: libmpi_intel.so.1  000015299969C478  MPI_Abort             Unknown  Unknown
 129: libmpifort_intel.  000015299BB3E69D  MPI_ABORT             Unknown  Unknown
 129: e3sm.exe           000000000394734A  mpas_log_mp_mpas_         844  mpas_log.f90
 129: e3sm.exe           00000000036286D8  seaice_error_mp_s         124  mpas_seaice_error.f90
 129: e3sm.exe           000000000358CBC6  seaice_column_mp_        2092  mpas_seaice_column.f90
 129: e3sm.exe           0000000003577EEA  seaice_column_mp_        1103  mpas_seaice_column.f90
 129: e3sm.exe           00000000034EC899  seaice_time_integ         135  mpas_seaice_time_integration.f90
 129: e3sm.exe           0000000003434976  ice_comp_mct_mp_i        1135  ice_comp_mct.f90
 129: e3sm.exe           000000000044639E  component_mod_mp_         757  component_mod.F90
 129: e3sm.exe           0000000000427C87  cime_comp_mod_mp_        2899  cime_comp_mod.F90
 129: e3sm.exe           000000000044601C  MAIN__                    153  cime_driver.F90
 129: e3sm.exe           000000000042583D  Unknown               Unknown  Unknown
 129: libc-2.31.so       000015299803329D  __libc_start_main     Unknown  Unknown
 129: e3sm.exe           000000000042576A  Unknown               Unknown  Unknown

For at least some of the cases, resubmitting worked (might not have tried all of them). Some of the cases with this fail are using threads, some of them do not (128x1), but all are built OPT.

ndkeen avatar Aug 14 '23 16:08 ndkeen

I can confirm this happens using maint-2.0 on pm-cpu as well

mahf708 avatar Aug 14 '23 16:08 mahf708

I just hit this error again with something as simple as SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850.pm-cpu_intel

Actually, resubmitting hits same error. If repeatable, might be good case to debug further. /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me24-aug15/SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850.pm-cpu_intel.20230818_085333_xjedmo

ndkeen avatar Aug 18 '23 16:08 ndkeen

Noting that I have not seen this error in a while.

ndkeen avatar Oct 31 '23 01:10 ndkeen