amr-wind icon indicating copy to clipboard operation
amr-wind copied to clipboard

Out-of-memory crash with large sampling

Open rthedin opened this issue 6 months ago • 14 comments

static_box.txt AMR-Wind crashes on Kestrel CPU nodes with out-of-memory error when large sampling is requested. The error is below:

 slurmstepd: error: Detected 1 oom_kill event in StepId=4893832.0. Some of the step tasks have been OOM Killed.
 srun: error: x1008c5s2b0n0: task 108: Out Of Memory

Here is the sampling portion that is creating the out-of-memory error:

incflo.post_processing                =  box_lr 

box_lr.output_format    = netcdf
box_lr.output_frequency = 4
box_lr.fields           = velocity
box_lr.labels           = Low1 

box_lr.Low1.type         = PlaneSampler
box_lr.Low1.num_points   = 1451 929
box_lr.Low1.origin       = -985.0000 -10045.0000 5.0000
box_lr.Low1.axis1        = 29000.0000 0.0 0.0
box_lr.Low1.axis2        = 0.0 18560.0000 0.0
box_lr.Low1.normal       = 0.0 0.0 1.0
box_lr.Low1.offsets      = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0

I understand I'm asking for ~43M points, which will come with performance slowdowns. I can deal with slowdown, but I find it weird it is crashing. The memory footprint of u, v, and w of all these values should be about 1GB.

Things I tried (none worked):

  • AMR-Wind compiled with GNU
  • AMR-Wind compiled with Intel (oneapi classic)
  • AMR-Wind at different commit points (not the very latest, though-- note my offset keyword hasn't changed yet.)
  • native instead of netcdf
  • Split the large sampling into smaller ones (keeping the total sampling the same)
  • Runs with different number of nodes (all the way to 50)

Some observations:

  • If I comment out about half of the offsets and request >=4 nodes, it works.
    • If I request very few nodes (2 or 3), it crashes in the first time step (around the temperature_solve), before any actual sampling is about to take place.
    • If I request more nodes, (>=4), the main time loop starts, and it crashes on the very first time step where such sampling is happening (in the example above, the 4th).
  • If I leave all the offsets there, it crashes right after the Creating SamplerBase instance: PlaneSampler message.

Edit: Adding the input files for reproducibility static_box.txt setup_seagreen_prec_neutral.startAt0.i.txt

rthedin avatar Aug 06 '24 21:08 rthedin