E3SM
E3SM copied to clipboard
spurious error on pm-cpu
With master (Aug7th) I've been trying various short cases to look at pelayout performance for several configs. One error I'm seeing with a frequency enough to create an issue. I don't yet see any pattern (except does not happen without ATM). These are runs on pm-cpu with Intel.
For this repo, I've hit the error 17 times out of 127 attempts (some are F cases, other WC. Some are ne30, others ne120)
The sources written in error mesg may not be a good hint to the issue -- so I don't want to assume it's an issue with Kokkos. Note I also get this error when I tried an incorrect resolution.
129: *** ma_convproc_tend error, massbal2 59 so4_a5 -- maxflux, sumflux, relerr = 4.283824E-14 -3.235764E-20 -7.553448E-07
129: *** ma_convproc_tend error, massbal2 63 soa_a1 -- maxflux, sumflux, relerr = 8.537513E-13 -7.612242E-19 -8.916229E-07
129: *** ma_convproc_tend error, massbal2 67 bc_a3 -- maxflux, sumflux, relerr = 1.892707E-15 4.001221E-23 2.114020E-08
129: *** ma_convproc_tend error, massbal4 69 dst_a1 -- maxflux, sumflux, relerr = 1.071177E-12 -1.206128E-20 -1.125985E-08
...
129: MPICH Notice [Rank 129] [job id 13886964.0] [Mon Aug 14 08:34:57 2023] [nid004324] - Abort(0) (rank 129 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 129
129:
129: aborting job:
129: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 129
129: forrtl: severe (174): SIGSEGV, segmentation fault occurred
129: Image PC Routine Line Source
129: libpthread-2.31.s 00001529989738C0 Unknown Unknown Unknown
129: e3sm.exe 00000000047F0334 Unknown Unknown Unknown
129: e3sm.exe 00000000048089E5 Unknown Unknown Unknown
129: e3sm.exe 0000000004808BF1 Unknown Unknown Unknown
129: e3sm.exe 0000000001FB9AC1 ~SharedAllocation 309 Kokkos_SharedAlloc.hpp
129: e3sm.exe 00000000048118C1 Unknown Unknown Unknown
129: e3sm.exe 0000000001FDE1D6 ~IslMpi 447 compose_slmm_islmpi.hpp
129: e3sm.exe 0000000001FD9D9C ~__shared_ptr 154 shared_ptr_base.h
129: libc-2.31.so 000015299804BAE9 Unknown Unknown Unknown
129: libc-2.31.so 000015299804BC7A Unknown Unknown Unknown
129: libpmi2.so.0 00001529959EF9E4 PMI_CRAY_Get_base Unknown Unknown
129: libmpi_intel.so.1 000015299AF906E2 Unknown Unknown Unknown
129: libmpi_intel.so.1 000015299969C478 MPI_Abort Unknown Unknown
129: libmpifort_intel. 000015299BB3E69D MPI_ABORT Unknown Unknown
129: e3sm.exe 000000000394734A mpas_log_mp_mpas_ 844 mpas_log.f90
129: e3sm.exe 00000000036286D8 seaice_error_mp_s 124 mpas_seaice_error.f90
129: e3sm.exe 000000000358CBC6 seaice_column_mp_ 2092 mpas_seaice_column.f90
129: e3sm.exe 0000000003577EEA seaice_column_mp_ 1103 mpas_seaice_column.f90
129: e3sm.exe 00000000034EC899 seaice_time_integ 135 mpas_seaice_time_integration.f90
129: e3sm.exe 0000000003434976 ice_comp_mct_mp_i 1135 ice_comp_mct.f90
129: e3sm.exe 000000000044639E component_mod_mp_ 757 component_mod.F90
129: e3sm.exe 0000000000427C87 cime_comp_mod_mp_ 2899 cime_comp_mod.F90
129: e3sm.exe 000000000044601C MAIN__ 153 cime_driver.F90
129: e3sm.exe 000000000042583D Unknown Unknown Unknown
129: libc-2.31.so 000015299803329D __libc_start_main Unknown Unknown
129: e3sm.exe 000000000042576A Unknown Unknown Unknown
For at least some of the cases, resubmitting worked (might not have tried all of them). Some of the cases with this fail are using threads, some of them do not (128x1), but all are built OPT.
I can confirm this happens using maint-2.0 on pm-cpu as well
I just hit this error again with something as simple as SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850.pm-cpu_intel
Actually, resubmitting hits same error. If repeatable, might be good case to debug further.
/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me24-aug15/SMS_P128x1_Ld10.ne4_oQU240.WCYCL1850.pm-cpu_intel.20230818_085333_xjedmo
Noting that I have not seen this error in a while.