amr-wind icon indicating copy to clipboard operation
amr-wind copied to clipboard

Segfaults on initial iteration/pressure solve on Frontier

Open lawrenceccheung opened this issue 1 year ago • 3 comments

I'm seeing some strange segfaults with AMR-Wind on the initial iteration/pressure solve on Frontier. It looks like it depends on the number of CPU's that I use -- on one run with 16 CPU's it worked fine, but other cases with 8 or 64 CPU's it crashes. Even more interestingly, there are some runs with 36 CPU's where it'll run fine, then if I resubmit the exact same case, it'll crash again later.

The case is a sample ABL case with 4 refinement levels using this input file. The point of failure is always the same, inside the initial pressure projection step right before it reports L-inf norm MAC vels: after MAC projection:

L-inf norm MAC vels: before MAC projection
..............................................................................
Max u:          7.385139484 |  Location (x,y,z):        512,     512.25,      50.25
Min u:          -7.30398297 |  Location (x,y,z):          0,          0,          0
Max v:          7.385132626 |  Location (x,y,z):     512.25,        512,      50.25
Min v:         -7.054704724 |  Location (x,y,z):          0,          0,          0
Max w:       0.006505619572 |  Location (x,y,z):        257,        257,          2
Min w:      -0.006414281405 |  Location (x,y,z):        767,        767,          2
..............................................................................

  MAC_projection                12        0.009603343577        2.66402833e-09
srun: error: frontier10357: tasks 5,7: Segmentation fault
srun: Terminating StepId=1530799.0
slurmstepd: error: *** STEP 1530799.0 ON frontier10357 CANCELLED AT 2023-12-11T19:56:19 ***
srun: error: frontier10357: tasks 0,2-4,6: Terminated
srun: error: frontier10357: task 1: Segmentation fault (core dumped)
srun: Force Terminated StepId=1530799.0

Note the following things:

  • This setup works fine on Frontier GPU's with any number of ranks that I've tried.
  • Reducing the number of refinement levels seems to help, although I haven't tried every possible number of max_level and number of CPU's yet.

Lawrence

lawrenceccheung avatar Dec 13 '23 17:12 lawrenceccheung

Not sure if this helps any, I compiled RelWithDebInfo and then ran it again. Here's the gdb core information:

#0  0x0000000000370621 in amr_wind::diagnostics::get_macvel_max(amrex::MultiFab const&, amrex::iMultiFab const&, int, double)::$_2::operator()(amrex::Box const&, amrex::Array4<double const> const&, amrex::Array4<int const> const&) const::{lambda(int, int, int)#1}::operator()(int, int, int) const (i=<optimized out>, j=364, k=192, this=<optimized out>) at /lustre/orion/cfd162/proj-shared/lcheung/spackbuilds/spack-manager.1/environments/amrwinddebug/amr-wind/amr-wind/utilities/diagnostics.cpp:94
94	                    (mask_arr(i, j, k) + mask_arr(ii, jj, kk) > 0 ? 1.0 : -1.0);

lawrenceccheung avatar Dec 21 '23 02:12 lawrenceccheung

If this is repeatable we should be able to track it down - can you find out what (I,j,k) and (ii,jj,kk) are when it dies?

Sent from my iPhone

On Dec 20, 2023, at 6:21 PM, lawrenceccheung @.***> wrote:



Not sure if this helps any, I compiled RelWithDebInfo and then ran it again. Here's the gdb core information:

#0 0x0000000000370621 in amr_wind::diagnostics::get_macvel_max(amrex::MultiFab const&, amrex::iMultiFab const&, int, double)::$_2::operator()(amrex::Box const&, amrex::Array4 const&, amrex::Array4 const&) const::{lambda(int, int, int)#1}::operator()(int, int, int) const (i=, j=364, k=192, this=) at /lustre/orion/cfd162/proj-shared/lcheung/spackbuilds/spack-manager.1/environments/amrwinddebug/amr-wind/amr-wind/utilities/diagnostics.cpp:94 94 (mask_arr(i, j, k) + mask_arr(ii, jj, kk) > 0 ? 1.0 : -1.0);

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/941#issuecomment-1865384591, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YTWMHXKEISWN2OGRI3YKOMMBAVCNFSM6AAAAABATQJ5HKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRVGM4DINJZGE . You are receiving this because you were assigned.Message ID: @.***>

asalmgren avatar Dec 21 '23 03:12 asalmgren

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Jan 21 '24 02:01 github-actions[bot]

UPDATE: the function that Lawrence isolated above is only called when incflo.verbose is greater than 2. Though it has a bug that should be fixed, the immediate solution (requiring no code change) is just to set incflo.verbose to a lower number. My recommendation for most users would be to use incflo.verbose = 0 because many of the verbose outputs are for helping developers diagnose issues and otherwise make the log files hard to read.

The bug fix should be coming soon, but you don't need to wait for it if you change that input.

mbkuhn avatar May 09 '24 22:05 mbkuhn