scream Occasional hang in ATM init for ne256/ne1024 cases on frontier

With scream master, I've been trying some cases on frontier. About 10% of the cases (can get better estimate of frequency) are hanging at what looks like the same location. So far, I've only seen the hang with ne1024 (with 320 to 2048 nodes) and ne256 (at 192 nodes). I know it happens without threading as well.

Catching a job hanging, logging into compute node, and checking status yields this as location:

(gdb) where
#0  cxip_ep_progress (fid=<optimized out>) at prov/cxi/src/cxip_ep.c:184
#1  0x00007f83b14e27f9 in cxip_util_cq_progress (util_cq=0x1130eff0) at prov/cxi/src/cxip_cq.c:112
#2  0x00007f83b14be191 in ofi_cq_readfrom (cq_fid=0x1130eff0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232
#3  0x00007f846637e587 in MPIR_Wait_impl.part.0 () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#4  0x00007f84671751f6 in MPIC_Wait () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#5  0x00007f8467188bbc in MPIC_Sendrecv () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#6  0x00007f846708a9bc in MPIR_Allreduce_intra_recursive_doubling () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#7  0x00007f84655fc601 in MPIR_Allreduce_intra_auto () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#8  0x00007f84655fc7e5 in MPIR_Allreduce_impl () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#9  0x00007f8467380287 in MPIR_CRAY_Allreduce () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#10 0x00007f846565e59c in PMPI_Allreduce () from /opt/cray/pe/mpich/8.1.26/ofi/crayclang/14.0/lib/libmpi_cray.so.12
#11 0x00000000026e0448 in int scream::impl::field_min<int>(scream::Field const&, ekat::Comm const*) ()
#12 0x00000000026d9744 in scream::AbstractGrid::get_global_min_dof_gid() const ()
#13 0x00000000025bd63b in scream::SPA::set_grids(std::shared_ptr<scream::GridsManager const>) ()
#14 0x00000000026980ef in scream::AtmosphereProcessGroup::set_grids(std::shared_ptr<scream::GridsManager const>) ()
#15 0x00000000026980ef in scream::AtmosphereProcessGroup::set_grids(std::shared_ptr<scream::GridsManager const>) ()
#16 0x00000000026980ef in scream::AtmosphereProcessGroup::set_grids(std::shared_ptr<scream::GridsManager const>) ()
#17 0x0000000001977221 in scream::control::AtmosphereDriver::create_grids() ()
#18 0x0000000000645590 in scream_create_atm_instance ()
#19 0x00000000006404ef in atm_init_mct$atm_comp_mct_ ()
#20 0x000000000046110b in component_init_cc$component_mod_ ()
#21 0x000000000042dd2e in cime_init$cime_comp_mod_ ()
#22 0x000000000045ec23 in main ()

We don't know if the root cause is pointed to here, but might at least verify.

auto AbstractGrid::
get_global_min_dof_gid () const ->gid_type
{
  // Lazy calculation                                                                                                                                                                                                                                                   
  if (m_global_min_dof_gid==std::numeric_limits<gid_type>::max()) {
    m_global_min_dof_gid = field_min<gid_type>(m_dofs_gids,&get_comm());
  }
  return m_global_min_dof_gid;
}

I did try one case with our previous Cess branch and did not see a hang, but may need many more cases to be able to say it is for sure not happening there as well.

It was suggested that I add some prints around here to get more info, but I have not done that yet.

Jan 23 '24 18:01 ndkeen

It is possible that this is NOT the bad point. I looked at the code, and it seems clean (there's a separate issue, as I pointed out on slack, but that should NOT cause hang, just re-doing same work twice).

I am wondering if another rank has somewhat gotten stuck somewhere else. We should add a comm barrier to rule that out. E.g., we could add the barrier right before the get_global_min_dof_gid call in SPA::set_grids, with a print statement right after that. If all ranks are currently there, we should see the print, otherwise we would not see it. We may have to replace the print with a write to file (maybe just on rank0), so that we don't risk not seeing it b/c of how the job is capturing stdout and redirecting it to e3sm.log.

Jan 23 '24 19:01 bartgol

Actually, the SPA pr that just got merged changes the init sequence of SPA. We no longer call the get_global_min_dof_gid in SPA, so we should re-run with current master, to see where the hang moved (assuming it still hangs).

Jan 23 '24 19:01 bartgol

Yes it is certainly possible that root cause is before this point. I did capture the stack of another process (MPI) and it showed it was also in the same place, but there could still be just one MPI stuck somewhere else.

Note my attempts were with scream master of a few days ago, so obv have not tried with new SPA fixes

Jan 23 '24 19:01 ndkeen

TBC, I don't expect the SPA pr to fix any bug that was in the code. At least not very likely. But we may as well continue the debug effort with an up-to-date master.

Jan 23 '24 19:01 bartgol

I quickly went through the code (before the SPA pr merge), and I think the last place where MPI happens before the SPA::set_grid call is when the AD builds the grids. It may be a good idea to add some logging. E.g., we could add these (additions have a +).

In atmosphere_driver.cpp:


+  m_atm_comm.barrier();
+  m_atm_logger->info("  [EAMxx] Setting grids in atm procs...");
+  m_atm_logger->flush(); 
   m_atm_process_group->set_grids(m_grids_manager);
+  m_atm_comm.barrier();
+  m_atm_logger->info("  [EAMxx] Setting grids in atm procs...DONE!");
+  m_atm_logger->flush();

In atmosphere_process_group.cpp:


   for (auto& atm_proc : m_atm_processes) {
+  m_comm.barrier();  
+  m_atm_logger->info("  [EAMxx] Setting grids in atm proc '" + name() + "'...");
+  m_atm_logger->flush();
     atm_proc->set_grids(grids_manager);
+  m_comm.barrier();  
+  m_atm_logger->info("  [EAMxx] Setting grids in atm proc '" + name() + "'...DONE!");
+  m_atm_logger->flush();

This should help us understand where the processes get disconnected, by looking at the last message in the atm log file...

Jan 23 '24 19:01 bartgol

I'm a big fan of adding output that gives us more info about what's happening. Esp in places that typically take a long time.

Jan 23 '24 19:01 ndkeen

I'm trying some cases with the above logging changes you suggested.

Jan 25 '24 02:01 ndkeen

I've been running cases using checkouts that contain

4c1ce5cf31 2024-01-23 10:33:13 -0700 Merge pull request #2646 from E3SM-Project/bartgol/eamxx/spa-use-horiz-interp-remapper

that Luca suggested might impact hanging. So far, I have 12 cases with no hangs. But may still be too early to know.

There was also a maintenance recently...

Jan 25 '24 17:01 ndkeen

Just updating this issue: I've been doing some autotune testing with a Jan25th checkout (and a few others) on frontier and have yet to see the flavor of hang noted above.

Jan 29 '24 17:01 ndkeen

Since we can't prove that the pb is gone by simply not seeing it, we need to use some statistical approach. @ndkeen how often did you notice the hang before (more or less)? We could use that to come up with a good number of runs after which we can agree that the problem has been solved.

Jan 29 '24 20:01 bartgol

scream scream copied to clipboard

Occasional hang in ATM init for ne256/ne1024 cases on frontier

scream
scream copied to clipboard