amr-wind
amr-wind copied to clipboard
Segfaults on initial iteration/pressure solve on Frontier
I'm seeing some strange segfaults with AMR-Wind on the initial iteration/pressure solve on Frontier. It looks like it depends on the number of CPU's that I use -- on one run with 16 CPU's it worked fine, but other cases with 8 or 64 CPU's it crashes. Even more interestingly, there are some runs with 36 CPU's where it'll run fine, then if I resubmit the exact same case, it'll crash again later.
The case is a sample ABL case with 4 refinement levels using this input file. The point of failure is always the same, inside the initial pressure projection step right before it reports L-inf norm MAC vels: after MAC projection
:
L-inf norm MAC vels: before MAC projection
..............................................................................
Max u: 7.385139484 | Location (x,y,z): 512, 512.25, 50.25
Min u: -7.30398297 | Location (x,y,z): 0, 0, 0
Max v: 7.385132626 | Location (x,y,z): 512.25, 512, 50.25
Min v: -7.054704724 | Location (x,y,z): 0, 0, 0
Max w: 0.006505619572 | Location (x,y,z): 257, 257, 2
Min w: -0.006414281405 | Location (x,y,z): 767, 767, 2
..............................................................................
MAC_projection 12 0.009603343577 2.66402833e-09
srun: error: frontier10357: tasks 5,7: Segmentation fault
srun: Terminating StepId=1530799.0
slurmstepd: error: *** STEP 1530799.0 ON frontier10357 CANCELLED AT 2023-12-11T19:56:19 ***
srun: error: frontier10357: tasks 0,2-4,6: Terminated
srun: error: frontier10357: task 1: Segmentation fault (core dumped)
srun: Force Terminated StepId=1530799.0
Note the following things:
- This setup works fine on Frontier GPU's with any number of ranks that I've tried.
- Reducing the number of refinement levels seems to help, although I haven't tried every possible number of max_level and number of CPU's yet.
Lawrence
Not sure if this helps any, I compiled RelWithDebInfo
and then ran it again. Here's the gdb core information:
#0 0x0000000000370621 in amr_wind::diagnostics::get_macvel_max(amrex::MultiFab const&, amrex::iMultiFab const&, int, double)::$_2::operator()(amrex::Box const&, amrex::Array4<double const> const&, amrex::Array4<int const> const&) const::{lambda(int, int, int)#1}::operator()(int, int, int) const (i=<optimized out>, j=364, k=192, this=<optimized out>) at /lustre/orion/cfd162/proj-shared/lcheung/spackbuilds/spack-manager.1/environments/amrwinddebug/amr-wind/amr-wind/utilities/diagnostics.cpp:94
94 (mask_arr(i, j, k) + mask_arr(ii, jj, kk) > 0 ? 1.0 : -1.0);
If this is repeatable we should be able to track it down - can you find out what (I,j,k) and (ii,jj,kk) are when it dies?
Sent from my iPhone
On Dec 20, 2023, at 6:21 PM, lawrenceccheung @.***> wrote:
Not sure if this helps any, I compiled RelWithDebInfo and then ran it again. Here's the gdb core information:
#0 0x0000000000370621 in
amr_wind::diagnostics::get_macvel_max(amrex::MultiFab const&,
amrex::iMultiFab const&, int, double)::$_2::operator()(amrex::Box
const&, amrex::Array4
— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/941#issuecomment-1865384591, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YTWMHXKEISWN2OGRI3YKOMMBAVCNFSM6AAAAABATQJ5HKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRVGM4DINJZGE . You are receiving this because you were assigned.Message ID: @.***>
This issue is stale because it has been open 30 days with no activity.
UPDATE: the function that Lawrence isolated above is only called when incflo.verbose is greater than 2. Though it has a bug that should be fixed, the immediate solution (requiring no code change) is just to set incflo.verbose to a lower number. My recommendation for most users would be to use incflo.verbose = 0 because many of the verbose outputs are for helping developers diagnose issues and otherwise make the log files hard to read.
The bug fix should be coming soon, but you don't need to wait for it if you change that input.