amr-wind
amr-wind copied to clipboard
Out-of-memory crash with large sampling
static_box.txt AMR-Wind crashes on Kestrel CPU nodes with out-of-memory error when large sampling is requested. The error is below:
slurmstepd: error: Detected 1 oom_kill event in StepId=4893832.0. Some of the step tasks have been OOM Killed.
srun: error: x1008c5s2b0n0: task 108: Out Of Memory
Here is the sampling portion that is creating the out-of-memory error:
incflo.post_processing = box_lr
box_lr.output_format = netcdf
box_lr.output_frequency = 4
box_lr.fields = velocity
box_lr.labels = Low1
box_lr.Low1.type = PlaneSampler
box_lr.Low1.num_points = 1451 929
box_lr.Low1.origin = -985.0000 -10045.0000 5.0000
box_lr.Low1.axis1 = 29000.0000 0.0 0.0
box_lr.Low1.axis2 = 0.0 18560.0000 0.0
box_lr.Low1.normal = 0.0 0.0 1.0
box_lr.Low1.offsets = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0
I understand I'm asking for ~43M points, which will come with performance slowdowns. I can deal with slowdown, but I find it weird it is crashing. The memory footprint of u, v, and w of all these values should be about 1GB.
Things I tried (none worked):
- AMR-Wind compiled with GNU
- AMR-Wind compiled with Intel (oneapi classic)
- AMR-Wind at different commit points (not the very latest, though-- note my
offset
keyword hasn't changed yet.) -
native
instead ofnetcdf
- Split the large sampling into smaller ones (keeping the total sampling the same)
- Runs with different number of nodes (all the way to 50)
Some observations:
- If I comment out about half of the
offsets
and request >=4 nodes, it works.- If I request very few nodes (2 or 3), it crashes in the first time step (around the temperature_solve), before any actual sampling is about to take place.
- If I request more nodes, (>=4), the main time loop starts, and it crashes on the very first time step where such sampling is happening (in the example above, the 4th).
- If I leave all the offsets there, it crashes right after the
Creating SamplerBase instance: PlaneSampler
message.
Edit: Adding the input files for reproducibility static_box.txt setup_seagreen_prec_neutral.startAt0.i.txt