amr-wind Out-of-memory crash with large sampling

Out-of-memory crash with large sampling

Open rthedin opened this issue 6 months ago • 14 comments

static_box.txt AMR-Wind crashes on Kestrel CPU nodes with out-of-memory error when large sampling is requested. The error is below:

 slurmstepd: error: Detected 1 oom_kill event in StepId=4893832.0. Some of the step tasks have been OOM Killed.
 srun: error: x1008c5s2b0n0: task 108: Out Of Memory

Here is the sampling portion that is creating the out-of-memory error:

incflo.post_processing                =  box_lr 

box_lr.output_format    = netcdf
box_lr.output_frequency = 4
box_lr.fields           = velocity
box_lr.labels           = Low1 

box_lr.Low1.type         = PlaneSampler
box_lr.Low1.num_points   = 1451 929
box_lr.Low1.origin       = -985.0000 -10045.0000 5.0000
box_lr.Low1.axis1        = 29000.0000 0.0 0.0
box_lr.Low1.axis2        = 0.0 18560.0000 0.0
box_lr.Low1.normal       = 0.0 0.0 1.0
box_lr.Low1.offsets      = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0

I understand I'm asking for ~43M points, which will come with performance slowdowns. I can deal with slowdown, but I find it weird it is crashing. The memory footprint of u, v, and w of all these values should be about 1GB.

Things I tried (none worked):

AMR-Wind compiled with GNU
AMR-Wind compiled with Intel (oneapi classic)
AMR-Wind at different commit points (not the very latest, though-- note my offset keyword hasn't changed yet.)
native instead of netcdf
Split the large sampling into smaller ones (keeping the total sampling the same)
Runs with different number of nodes (all the way to 50)

Some observations:

If I comment out about half of the offsets and request >=4 nodes, it works.
- If I request very few nodes (2 or 3), it crashes in the first time step (around the temperature_solve), before any actual sampling is about to take place.
- If I request more nodes, (>=4), the main time loop starts, and it crashes on the very first time step where such sampling is happening (in the example above, the 4th).
If I leave all the offsets there, it crashes right after the Creating SamplerBase instance: PlaneSampler message.

Edit: Adding the input files for reproducibility static_box.txt setup_seagreen_prec_neutral.startAt0.i.txt

Aug 06 '24 21:08 rthedin

amr-wind amr-wind copied to clipboard

Out-of-memory crash with large sampling

amr-wind
amr-wind copied to clipboard