E3SM hommexx (thetaxx): issues with diagnostics on summit

A build of thetaxx crashes with

terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available


Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
....

#9  0x1023003f in _ZN6Kokkos12parallel_forINS_10TeamPolicyIJNS_4CudaEN5Homme11Diagnostics18EnergyHalfTimesTagEEEES4_EEvRKT_RKT0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPNSt9enable_ifIXsrNS_19is_execution_policyIS7_EE5valueEvE4typeE
	at /ccs/home/onguba/acme-master/externals/kokkos/core/src/Kokkos_Parallel.hpp:169
#10  0x1022aaab in _ZN5Homme11Diagnostics21prim_energy_halftimesEbi
	at /ccs/home/onguba/acme-master/components/homme/src/theta-l_kokkos/cxx/Diagnostics.cpp:177

Details are below.

Jul 09 '22 02:07 oksanaguba

Master from today, summit. I tried 2 execs, built with and without "macroWITH_ENERGY", theta-l-nlev72-kokkos (with) and theta-nlev-128-kokkos (without).

The namelist is below

bash-4.4$ cat bb128.nl 
&ctl_nl
NThreads=1
partmethod    = 4
topology      = "cube"
test_case     = "jw_baroclinic"
u_perturb = 1
rotate_grid = 0
ne = 2
nmax         =12
qsize = 1
statefreq=6
disable_diagnostics = .false.
restartfreq   = 43200
restartfile   = "./R0001"
runtype       = 0
mesh_file='/dev/null'
tstep=1      ! ne30: 300  ne120: 75
rsplit=1       ! ne30: 3   ne120:  2
qsplit = 1
tstep_type = 5
integration   = "explicit"
nu=1e16
nu_div = -1 !nu_div=1e16
!in e3sm there are no ne20 settings
!but the rule for nu_div is the same as for nu,
!C(dx)^3.2 with nu_div=2.5e15 for ne30
!so, approx (1.5)^3.2\simeq 4, 2.5e15*4=1e16
nu_p=1e16
nu_q=1e16
nu_s=1e16
nu_top = -1 !2.5e5
se_ftype     = 0
limiter_option = 8
vert_remap_q_alg = 1
hypervis_scaling=0
hypervis_order = 2
hypervis_subcycle=3    ! ne30: 3  ne120: 4
theta_hydrostatic_mode=.true.
/
&solver_nl
precon_method = "identity"
maxits        = 500
tol           = 1.e-9
/
&vert_nl
vfile_mid = './sabm-128.ascii'
vfile_int = './sabi-128.ascii'
/
&prof_inparm
profile_outpe_num = 100
profile_single_file             = .true.
/

&analysis_nl
! disabled
 output_timeunits=1,1
 output_frequency=-1,-1
 output_start_time=0,0
 output_end_time=30000,30000
 output_varnames1='ps','zeta','T','geo'
 output_varnames2='Q','Q2','Q3','Q4','Q5'
! output_prefix='xx-ne20-'
 io_stride=8
 output_type = 'netcdf'
/

Jul 09 '22 03:07 oksanaguba

On crusher i got different results for 2 execs, but on summit they behaved the same.

bash-4.4$ pwd
/ccs/home/onguba/as/summit/july2022-master
bash-4.4$ jsrun -n 6 -r 6 -l gpu-gpu -b packed:1 -d plane:1 -a1 -c7 -g1 --smpiargs "-gpu" test_execs/theta-nlev128-kokkos/theta-nlev128-kokkos < bb128.nl

This build was done with cache file summit-gpumpi-asserts.cmake , which is not is the repo. Flags in that file are only

set(OPT_CXXFLAGS "-O3" CACHE STRING "")

which allows asserts:

CXX_DEFINES = -DHAVE_CONFIG_H -DHOMMEXX_CONFIG_IS_CMAKE -DHOMME_WITHOUT_PIOLIBRARY -DINCLUDE_CMAKE_FCI -DKOKKOS_DEPENDENCE -DSPMD -D_NO_MPI_RSEND
...
CXX_FLAGS =  -g -std=c++14 --expt-extended-lambda -O3 -g  -expt-extended-lambda -Wext-lambda-captures-this -arch=sm_70

Since i ran the same 128 levels exec before successfully, but with bfb flags, i repeated bfb runs.

Jul 09 '22 03:07 oksanaguba

Are these extended diagnostics? SCREAM has been running fine on Summit and now PM, with qv diagnostics, for example, correct.

Jul 09 '22 03:07 ambrad

The bfb run used summit-bfb.cmake (not in repo), folder

/ccs/home/onguba/as/summit/july2022-master-bfb

set(CMAKE_C_FLAGS "-w" CACHE STRING "")
set(ADD_CXX_FLAGS "-Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Wno-unknown-pragmas --fmad=false -O0" CACHE STRING "")
set(ADD_Fortran_FLAGS " -ffp-contract=off -O0" CACHE STRING "")
set(OPT_FLAGS "-O0" CACHE STRING "")
set(DEBUG_FLAGS "-ffp-contract=off -g"CACHE STRING "")

and

CXX_DEFINES = -DCPRGNU -DHAVE_CONFIG_H -DHAVE_MPI -DHOMMEXX_CONFIG_IS_CMAKE -DINCLUDE_CMAKE_FCI -DKOKKOS_DEPENDENCE -DLOGGING -DMPICH_SKIP_MPICXX -DNETCDF_C_LOGGING_ENABLED -DNETCDF_C_NC__ENDDEF_EXISTS -DOMPI_SKIP_MPICXX -DSPMD -DTIMING -D_NETCDF -D_NOPNETCDF -D_NO_MPI_RSEND

...

CXX_FLAGS =  -g -std=c++14 --expt-extended-lambda -O0 -ffp-contract=off -g -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Wno-unknown-pragmas --fmad=false -O0 -expt-extended-lambda -Wext-lambda-captures-this -arch=sm_70

First i confirmed that HS test 1 still runs with its namelist, then i used the namelist from above, and it ran.

Jul 09 '22 03:07 oksanaguba

@ambrad what is extended diagnostics? also, this is with asserts, not sure it matters (one fix for the team size crashed with another assert with nslots<=0) .

Jul 09 '22 03:07 oksanaguba

I'm confused why SCREAM is fine on Summit, including debug builds, but you're encountering issues with standalone Hommexx.

Jul 09 '22 03:07 ambrad

I do not know why this does not work. However, with disable_diagnostics=.true., everything runs, except that stdout (the one in statefreq) does not show up (i dont remember if this is the same in F).

also, i found some nondeterministic behavior, may or may not be related, i will try to reproduce it.

Jul 09 '22 03:07 oksanaguba

I ran standalone homme with DNDEBUG and opt flags and both execs worked with disable_diagnostics = .false. and nontrivial statefreq (which also reported stdout).

Jul 09 '22 21:07 oksanaguba

@ambrad @bartgol This is a change that worked for me for the namelist i use (below). Note that I changed AMB's default pair (16,32) to (16,16) (otherwise i was still getting team too large errro). Also, now that it works, the code occasionally crashes with nans in repro_sum. Could it be that some barrier is missing in diagnostics?

[[email protected] theta-l_kokkos]$ git diff cxx/Diagnostics.hpp
diff --git a/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp b/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp
index 9598a30d73..76fcced624 100644
--- a/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp
+++ b/components/homme/src/theta-l_kokkos/cxx/Diagnostics.hpp
@@ -35,10 +35,34 @@ private:
     ExecViewUnmanaged<Scalar *[NP][NP][NUM_LEV_P]>  dpnh_dp_i;
   };
 
+
+  template <typename FunctorTag>
+  typename std::enable_if<OnGpu<ExecSpace>::value == false,
+                          Kokkos::TeamPolicy<ExecSpace, FunctorTag> >::type
+  d_team_policy(const int num_exec) {
+    return Homme::get_default_team_policy<ExecSpace, FunctorTag>(num_exec);
+  }
+
+  template <typename FunctorTag>
+  typename std::enable_if<OnGpu<ExecSpace>::value == true,
+                          Kokkos::TeamPolicy<ExecSpace, FunctorTag> >::type
+  d_team_policy(const int num_exec) {
+    ThreadPreferences tp;
+    tp.max_threads_usable = 16; //16
+    tp.max_vectors_usable = 16; //32
+    tp.prefer_larger_team = true;
+    return Homme::get_default_team_policy<ExecSpace, FunctorTag>(num_exec, tp);
+  }
+
+
+
+
 public:
 
+
   Diagnostics (const int num_elems, const bool theta_hydrostatic_mode) :
-    m_policy(Homme::get_default_team_policy<ExecSpace,EnergyHalfTimesTag>(num_elems)),
+    //m_policy(Homme::get_default_team_policy<ExecSpace,EnergyHalfTimesTag>(num_elems)),
+    m_policy(d_team_policy<EnergyHalfTimesTag>(num_elems)),
     m_tu(m_policy),
     m_num_elems(num_elems),
     m_theta_hydrostatic_mode(theta_hydrostatic_mode)

Jul 11 '22 17:07 oksanaguba

Is this for the opt build with asserts? If so, do the standard debug and opt builds work without changes?

Jul 11 '22 17:07 ambrad

Re: the NaNs, is that again in an opt build with asserts? Does the stack trace starting at repro_sum lead to the Diagnostics class?

Jul 11 '22 17:07 ambrad

In terms of solving this problem, if the stack trace indeed relates to diagnostics, I suggest the following:

Set statefreq = 1 so the diagnostics kernels run as much as possible.
If you can then get reliable crashes, insert team_barrier calls after every thread-team loop in the diagnostics kernels and see if they fix the problem.
If that works, we can then think about which team_barrier is actually needed.

Jul 11 '22 17:07 ambrad

The fix is for opt build with asserts. Opt build without asserts and debug build (bfb build for homme standalone) with asserts worked.

The NANs message does not give me a stack, the flags have "-g" :

           1  ABORTING WITH ERROR: NaNs detected in repro sum input
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 128.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Jul 11 '22 17:07 oksanaguba

Isn't setting max_vectors to 16 in OPT builds going to make us use only half warps? Or maybe use the whole warp, but with uncoalesced access? Perhaps we should set max_thread to 8 and leave max_vector to 32?

Jul 11 '22 17:07 bartgol

i need to check what i am doing wrt asserts, i will update soon.

Jul 11 '22 17:07 oksanaguba

To clarify:

Master branch, with diagnostics and statefreq=1.

1)With opt flags, without DNDEBUG (asserts on), code crashes with "team too large" in diagnostics, with fix from above working either for (16,16) or for (8,32) pairs.

2)With opt flags, with DNDEBUG (asserts off), code runs without the fix.

I will double check, but scream performance runs had DNDEBUG, therefore, they are consistent with 2).

2)with fix now produces nondeterministic crashes (nans in repro_sum). I will have to try barriers in diagn.

Jul 11 '22 18:07 oksanaguba

Thanks. Right, the SCREAM opt build (as true of all CIME-based opt builds, I believe) runs without asserts.

Re: "2) with fix now produces nondeterministic crashes (nans in repro_sum)." Ok, so you're saying that you when you do 16x16, you're getting failures. That does likely point to one or more missing team barriers in the kernels that use that team policy.

Jul 11 '22 19:07 ambrad

Team sizes and Nans in reprosums are addressed in https://github.com/E3SM-Project/E3SM/pull/5039 .

Sep 23 '22 06:09 oksanaguba