hommexx (thetaxx): issues with diagnostics on summit
A build of thetaxx crashes with
terminate called after throwing an instance of 'std::runtime_error'
what(): Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
....
#9 0x1023003f in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_4CudaEN5Homme11Diagnostics18EnergyHalfTimesTagEEEES4_EEvRKT_RKT0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNSt9enable_ifIXsrNS_19is_execution_policyIS7_EE5valueEvE4typeE
at /ccs/home/onguba/acme-master/externals/kokkos/core/src/Kokkos_Parallel.hpp:169
#10 0x1022aaab in _ZN5Homme11Diagnostics21prim_energy_halftimesEbi
at /ccs/home/onguba/acme-master/components/homme/src/theta-l_kokkos/cxx/Diagnostics.cpp:177
Details are below.
Master from today, summit. I tried 2 execs, built with and without "macroWITH_ENERGY", theta-l-nlev72-kokkos (with) and theta-nlev-128-kokkos (without).
The namelist is below
bash-4.4$ cat bb128.nl
&ctl_nl
NThreads=1
partmethod = 4
topology = "cube"
test_case = "jw_baroclinic"
u_perturb = 1
rotate_grid = 0
ne = 2
nmax =12
qsize = 1
statefreq=6
disable_diagnostics = .false.
restartfreq = 43200
restartfile = "./R0001"
runtype = 0
mesh_file='/dev/null'
tstep=1 ! ne30: 300 ne120: 75
rsplit=1 ! ne30: 3 ne120: 2
qsplit = 1
tstep_type = 5
integration = "explicit"
nu=1e16
nu_div = -1 !nu_div=1e16
!in e3sm there are no ne20 settings
!but the rule for nu_div is the same as for nu,
!C(dx)^3.2 with nu_div=2.5e15 for ne30
!so, approx (1.5)^3.2\simeq 4, 2.5e15*4=1e16
nu_p=1e16
nu_q=1e16
nu_s=1e16
nu_top = -1 !2.5e5
se_ftype = 0
limiter_option = 8
vert_remap_q_alg = 1
hypervis_scaling=0
hypervis_order = 2
hypervis_subcycle=3 ! ne30: 3 ne120: 4
theta_hydrostatic_mode=.true.
/
&solver_nl
precon_method = "identity"
maxits = 500
tol = 1.e-9
/
&vert_nl
vfile_mid = './sabm-128.ascii'
vfile_int = './sabi-128.ascii'
/
&prof_inparm
profile_outpe_num = 100
profile_single_file = .true.
/
&analysis_nl
! disabled
output_timeunits=1,1
output_frequency=-1,-1
output_start_time=0,0
output_end_time=30000,30000
output_varnames1='ps','zeta','T','geo'
output_varnames2='Q','Q2','Q3','Q4','Q5'
! output_prefix='xx-ne20-'
io_stride=8
output_type = 'netcdf'
/
On crusher i got different results for 2 execs, but on summit they behaved the same.
bash-4.4$ pwd
/ccs/home/onguba/as/summit/july2022-master
bash-4.4$ jsrun -n 6 -r 6 -l gpu-gpu -b packed:1 -d plane:1 -a1 -c7 -g1 --smpiargs "-gpu" test_execs/theta-nlev128-kokkos/theta-nlev128-kokkos < bb128.nl
This build was done with cache file summit-gpumpi-asserts.cmake , which is not is the repo. Flags in that file are only
set(OPT_CXXFLAGS "-O3" CACHE STRING "")
which allows asserts:
CXX_DEFINES = -DHAVE_CONFIG_H -DHOMMEXX_CONFIG_IS_CMAKE -DHOMME_WITHOUT_PIOLIBRARY -DINCLUDE_CMAKE_FCI -DKOKKOS_DEPENDENCE -DSPMD -D_NO_MPI_RSEND
...
CXX_FLAGS = -g -std=c++14 --expt-extended-lambda -O3 -g -expt-extended-lambda -Wext-lambda-captures-this -arch=sm_70
Since i ran the same 128 levels exec before successfully, but with bfb flags, i repeated bfb runs.
Are these extended diagnostics? SCREAM has been running fine on Summit and now PM, with qv diagnostics, for example, correct.
The bfb run used summit-bfb.cmake (not in repo), folder
/ccs/home/onguba/as/summit/july2022-master-bfb
set(CMAKE_C_FLAGS "-w" CACHE STRING "")
set(ADD_CXX_FLAGS "-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Wno-unknown-pragmas --fmad=false -O0" CACHE STRING "")
set(ADD_Fortran_FLAGS " -ffp-contract=off -O0" CACHE STRING "")
set(OPT_FLAGS "-O0" CACHE STRING "")
set(DEBUG_FLAGS "-ffp-contract=off -g"CACHE STRING "")
and
CXX_DEFINES = -DCPRGNU -DHAVE_CONFIG_H -DHAVE_MPI -DHOMMEXX_CONFIG_IS_CMAKE -DINCLUDE_CMAKE_FCI -DKOKKOS_DEPENDENCE -DLOGGING -DMPICH_SKIP_MPICXX -DNETCDF_C_LOGGING_ENABLED -DNETCDF_C_NC__ENDDEF_EXISTS -DOMPI_SKIP_MPICXX -DSPMD -DTIMING -D_NETCDF -D_NOPNETCDF -D_NO_MPI_RSEND
...
CXX_FLAGS = -g -std=c++14 --expt-extended-lambda -O0 -ffp-contract=off -g -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Wno-unknown-pragmas --fmad=false -O0 -expt-extended-lambda -Wext-lambda-captures-this -arch=sm_70
First i confirmed that HS test 1 still runs with its namelist, then i used the namelist from above, and it ran.
@ambrad what is extended diagnostics? also, this is with asserts, not sure it matters (one fix for the team size crashed with another assert with nslots<=0) .
I'm confused why SCREAM is fine on Summit, including debug builds, but you're encountering issues with standalone Hommexx.
I do not know why this does not work. However, with disable_diagnostics=.true., everything runs, except that stdout (the one in statefreq) does not show up (i dont remember if this is the same in F).
also, i found some nondeterministic behavior, may or may not be related, i will try to reproduce it.
I ran standalone homme with DNDEBUG and opt flags and both execs worked with disable_diagnostics = .false. and nontrivial statefreq (which also reported stdout).
@ambrad @bartgol This is a change that worked for me for the namelist i use (below). Note that I changed AMB's default pair (16,32) to (16,16) (otherwise i was still getting team too large errro). Also, now that it works, the code occasionally crashes with nans in repro_sum. Could it be that some barrier is missing in diagnostics?
[[email protected] theta-l_kokkos]$ git diff cxx/Diagnostics.hpp
diff --git a/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp b/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp
index 9598a30d73..76fcced624 100644
--- a/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp
+++ b/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp
@@ -35,10 +35,34 @@ private:
ExecViewUnmanaged<Scalar *[NP][NP][NUM_LEV_P]> dpnh_dp_i;
};
+
+ template <typename FunctorTag>
+ typename std::enable_if<OnGpu<ExecSpace>::value == false,
+ Kokkos::TeamPolicy<ExecSpace, FunctorTag> >::type
+ d_team_policy(const int num_exec) {
+ return Homme::get_default_team_policy<ExecSpace, FunctorTag>(num_exec);
+ }
+
+ template <typename FunctorTag>
+ typename std::enable_if<OnGpu<ExecSpace>::value == true,
+ Kokkos::TeamPolicy<ExecSpace, FunctorTag> >::type
+ d_team_policy(const int num_exec) {
+ ThreadPreferences tp;
+ tp.max_threads_usable = 16; //16
+ tp.max_vectors_usable = 16; //32
+ tp.prefer_larger_team = true;
+ return Homme::get_default_team_policy<ExecSpace, FunctorTag>(num_exec, tp);
+ }
+
+
+
+
public:
+
Diagnostics (const int num_elems, const bool theta_hydrostatic_mode) :
- m_policy(Homme::get_default_team_policy<ExecSpace,EnergyHalfTimesTag>(num_elems)),
+ //m_policy(Homme::get_default_team_policy<ExecSpace,EnergyHalfTimesTag>(num_elems)),
+ m_policy(d_team_policy<EnergyHalfTimesTag>(num_elems)),
m_tu(m_policy),
m_num_elems(num_elems),
m_theta_hydrostatic_mode(theta_hydrostatic_mode)
Is this for the opt build with asserts? If so, do the standard debug and opt builds work without changes?
Re: the NaNs, is that again in an opt build with asserts? Does the stack trace starting at repro_sum lead to the Diagnostics class?
In terms of solving this problem, if the stack trace indeed relates to diagnostics, I suggest the following:
- Set statefreq = 1 so the diagnostics kernels run as much as possible.
- If you can then get reliable crashes, insert team_barrier calls after every thread-team loop in the diagnostics kernels and see if they fix the problem.
- If that works, we can then think about which team_barrier is actually needed.
The fix is for opt build with asserts. Opt build without asserts and debug build (bfb build for homme standalone) with asserts worked.
The NANs message does not give me a stack, the flags have "-g" :
1 ABORTING WITH ERROR: NaNs detected in repro sum input
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 128.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
Isn't setting max_vectors to 16 in OPT builds going to make us use only half warps? Or maybe use the whole warp, but with uncoalesced access? Perhaps we should set max_thread to 8 and leave max_vector to 32?
i need to check what i am doing wrt asserts, i will update soon.
To clarify:
Master branch, with diagnostics and statefreq=1.
1)With opt flags, without DNDEBUG (asserts on), code crashes with "team too large" in diagnostics, with fix from above working either for (16,16) or for (8,32) pairs.
2)With opt flags, with DNDEBUG (asserts off), code runs without the fix.
I will double check, but scream performance runs had DNDEBUG, therefore, they are consistent with 2).
2)with fix now produces nondeterministic crashes (nans in repro_sum). I will have to try barriers in diagn.
Thanks. Right, the SCREAM opt build (as true of all CIME-based opt builds, I believe) runs without asserts.
Re: "2) with fix now produces nondeterministic crashes (nans in repro_sum)." Ok, so you're saying that you when you do 16x16, you're getting failures. That does likely point to one or more missing team barriers in the kernels that use that team policy.
Team sizes and Nans in reprosums are addressed in https://github.com/E3SM-Project/E3SM/pull/5039 .