WarpX
WarpX copied to clipboard
Cori: Bus Error instead of OOM
I encountered a certain case on Cori KNL nodes where a WarpX setup that likely ran out of memory reported a bus error
instead.
The binary was compiled as warpx.2d.MPI.OMP.DP.PDP.OPMD.PSATD.QED
Input Deck input_sets.zip There are 3 setups running the same 2D box size on one, two or four KNL nodes.
The submit file debug_2nodes.sbatch
fails with the following errors on 2 KNL nodes.
The only output in output.txt
reads
MPI initialized with 16 MPI processes
MPI initialized with thread support level 3
OMP initialized with 8 OMP threads
AMReX (22.06-39-g2d931f63cb4d) initialized
WarpX (22.06-22-g6be401a3c732)
PICSAR (2becfe066559)
Level 0: dt = 4.338939207e-18 ; dx = 1.302083333e-09 ; dz = 1.302083333e-09
The WarpX.e<JobID>
file shows the bus error:
srun: error: nid09797: task 7: Bus error
srun: launch/slurm: _step_signal: Terminating StepId=61191894.0
slurmstepd: error: *** STEP 61191894.0 ON nid09797 CANCELLED AT 2022-07-19T15:34:18 ***
srun: error: nid09797: tasks 3-4: Terminated
srun: error: nid09797: tasks 0,5-6: Terminated
srun: error: nid09797: task 1: Terminated
srun: error: nid09798: tasks 8-9,11,13,15: Terminated
srun: error: nid09797: task 2: Terminated
srun: error: nid09798: tasks 10,12,14: Terminated
srun: Force Terminated StepId=61191894.0
Fails w/ OOM Error | Fails w/ Bus Error | Runs |
---|---|---|
1 Node (8 MPI ranks) | 2 Nodes (16 MPI ranks) | 4 Nodes (32 MPI ranks) |
amr.blocking_factor = 128 amr.max_grid_size_x = 5760 amr.max_grid_size_y = 5760 |
amr.blocking_factor = 64 amr.max_grid_size_x = 2880 amr.max_grid_size_y = 5760 |
amr.blocking_factor = 32 amr.max_grid_size_x = 1440 amr.max_grid_size_y = 2880 |
Only the setup asking for a single node with 8 MPI ranks fails with the (expected) OOM error:
WarpX.e<JobID>
file:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=61191824.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid02519: task 5: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=61191824.0
slurmstepd: error: *** STEP 61191824.0 ON nid02519 CANCELLED AT 2022-07-19T15:24:02 ***
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=61191824.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Thanks for the report!
The problem here is that we did not fail in a new
but got caught by the system (sigkill) on Cori, which we cannot handle at that level.
Maybe @kngott or @WeiqunZhang have more thoughts on this; I cannot see an obvious way to handle this more user-friendly right away.