amr-wind
amr-wind copied to clipboard
Out-of-memory crash with large sampling
static_box.txt AMR-Wind crashes on Kestrel CPU nodes with out-of-memory error when large sampling is requested. The error is below:
slurmstepd: error: Detected 1 oom_kill event in StepId=4893832.0. Some of the step tasks have been OOM Killed.
srun: error: x1008c5s2b0n0: task 108: Out Of Memory
Here is the sampling portion that is creating the out-of-memory error:
incflo.post_processing = box_lr
box_lr.output_format = netcdf
box_lr.output_frequency = 4
box_lr.fields = velocity
box_lr.labels = Low1
box_lr.Low1.type = PlaneSampler
box_lr.Low1.num_points = 1451 929
box_lr.Low1.origin = -985.0000 -10045.0000 5.0000
box_lr.Low1.axis1 = 29000.0000 0.0 0.0
box_lr.Low1.axis2 = 0.0 18560.0000 0.0
box_lr.Low1.normal = 0.0 0.0 1.0
box_lr.Low1.offsets = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0
I understand I'm asking for ~43M points, which will come with performance slowdowns. I can deal with slowdown, but I find it weird it is crashing. The memory footprint of u, v, and w of all these values should be about 1GB.
Things I tried (none worked):
- AMR-Wind compiled with GNU
- AMR-Wind compiled with Intel (oneapi classic)
- AMR-Wind at different commit points (not the very latest, though-- note my
offsetkeyword hasn't changed yet.) nativeinstead ofnetcdf- Split the large sampling into smaller ones (keeping the total sampling the same)
- Runs with different number of nodes (all the way to 50)
Some observations:
- If I comment out about half of the
offsetsand request >=4 nodes, it works.- If I request very few nodes (2 or 3), it crashes in the first time step (around the temperature_solve), before any actual sampling is about to take place.
- If I request more nodes, (>=4), the main time loop starts, and it crashes on the very first time step where such sampling is happening (in the example above, the 4th).
- If I leave all the offsets there, it crashes right after the
Creating SamplerBase instance: PlaneSamplermessage.
Edit: Adding the input files for reproducibility static_box.txt setup_seagreen_prec_neutral.startAt0.i.txt
With a run that completes, what's the output of a build with this flag switch to on https://github.com/Exawind/amr-wind/blob/main/cmake/set_amrex_options.cmake#L19 ?
Here it is. This run takes about half of the offsets in the list above. I also added the tiny profiler and can share the results if useful.
Pinned Memory Usage:
------------------------------------------------------------------------------------------------------------------------------------------------
Name Nalloc Nfree AvgMem min AvgMem avg AvgMem max MaxMem min MaxMem avg MaxMem max
------------------------------------------------------------------------------------------------------------------------------------------------
The_Pinned_Arena::Initialize() 312 312 60 B 119 B 153 B 8192 KiB 8192 KiB 8192 KiB
amr-wind::PlaneAveragingFine::compute_averages 6864 6864 7 B 7 B 7 B 3072 B 3072 B 3072 B
amr-wind::VelPlaneAveragingFine::compute_hvelmag_averages 10296 10296 4 B 4 B 4 B 3072 B 3072 B 3072 B
amr-wind::PlaneAveraging::compute_averages 10296 10296 0 B 1 B 2 B 768 B 768 B 768 B
amr-wind::VelPlaneAveraging::compute_hvelmag_averages 3432 3432 0 B 0 B 0 B 256 B 256 B 256 B
------------------------------------------------------------------------------------------------------------------------------------------------
Just realized that a memlog was created
Final Memory Profile Report Across Processes:
| Name | Current | High Water Mark |
|-----------------+--------------------+--------------------|
| Fab | 0 ... 0 B | 1496 ... 1869 MB |
| MemPool | 8192 ... 8192 KB | 8192 ... 8192 KB |
| BoxArrayHash | 0 ... 0 B | 4560 ... 4569 KB |
| BoxArray | 0 ... 0 B | 3896 ... 3896 KB |
|-----------------+--------------------+--------------------|
| Total | 8192 ... 8192 KB | |
| Name | Current # | High Water Mark # |
|-----------------+--------------------+--------------------|
| BoxArray Innard | 0 ... 0 | 40 ... 40 |
| MultiFab | 0 ... 0 | 994 ... 999 |
* Proc VmPeak VmSize VmHWM VmRSS
[ 5236 ... 8173 MB] [ 2175 ... 3136 MB] [ 4553 ... 7116 MB] [ 1677 ... 2639 MB]
* Node total free free+buffers+cached shared
[ 250 ... 250 GB] [ 122 ... 135 GB] [ 142 ... 145 GB] [ 534 ... 641 MB]
I got the same error message for a Test case with 360 million grid points on 6 Kestrel Nodes (using 96 cores in each). The case is similar to the one Regis submitted.
The failure seems to be happening only on CPU. I tried running the same simulations on GPU and the simulations ran without any issues.
I did not have success on the GPU. Time per time step increase five-fold on 2 GPU nodes; and case still crashed OOM on a single GPU node.
I tried a larger case and OOM happened on both CPU and GPU.
Ganesh has tried it as well with his exawind-manager build that includes some of the extra HDF5 flags. He tried with 8 and 100 nodes. No luck, same error.
I will be looking into this with the case files @rthedin gave me. Hopefully this week.
I got it to work using 8 GPUs (2 nodes on kestrel). If I use less, I can see the memory creeping up, followed by a crash. Each GPU has about 80GB of memory
Some preliminary data that Jon and I were looking at:
Case:
No AMR, ABL case, not very big:
Level 0 375 grids 12288000 cells 100 % of domain
smallest grid: 32 x 32 x 32 biggest grid: 32 x 32 x 32
Running on 1 Kestrel node, 104 ranks, 250GB of RAM, Intel build.
Sampling section of the input file is the interesting part:
incflo.post_processing = box_lr
# ---- Low-res sampling parameters ---- # box_lr.output_format = netcdf
box_lr.output_format = native box_lr.output_frequency = 2
box_lr.fields = velocity box_lr.labels = Low
# Low sampling grid spacing = 20 m
box_lr.Low.type = PlaneSampler box_lr.Low.num_points = 1451 929
box_lr.Low.origin = -985.0000 -10045.0000 5.0000 box_lr.Low.axis1 = 29000.0000 0.0 0.0
box_lr.Low.axis2 = 0.0 18560.0000 0.0
box_lr.Low.normal = 0.0 0.0 1.0
box_lr.Low.offsets = 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 220.0 240.0 260.0 280.0 300.0 320.0 340.0 360.0 380.0 400.0 420.0 440.0 460.0 480.0 500.0 520.0 540.0 560.0 580.0
I ran with 0, 1, 2, and 4 sampling planes so I could get a simulation that completes. 4 time steps total, with a sampling frequency of 2.
Results
no sampling
1 plane
2 planes
4 planes
Conclusions
- with all the sampling planes, this causes an OOM
- with no sampling, AMR-Wind is using about
200 MB * 104 ranks = 20GBwhich works out to 1.7kb/cell. Is that reasonable? idk, maybe if I thought about it enough. - Each additional sampling plane adds about 100MB of RAM usage per rank. Naively I would expect
(1451*929) particles *((3+3) double fields *8 bytes + 2 int fields * 4 bytes)/(1024 bytes*1024bytes) = 75Mbneeded per plane, total. But not per rank. Unclear why all the ranks need that much extra memory. - Rank 0 doing IO (? I think, need to confirm) is clearly visible. Or it is creating the particles and then calling redistribute to the other ranks (which would explain the time delay of the spike on rank 0 and then the other ranks' memory increasing).
Next steps
-
Validate/invalidate hypothesis: particles created on rank 0 and then redistributed should be created on all ranks at the same time, in parallel (I've done this in other projects)
-
Validate/invalidate hypothesis: the mismatch in my naive understanding of how much memory a particle needs and how much it actually needs is due to my lack of understanding of how much data a particle needs. Maybe we are adding more data fields to the particles than they actually need.
-
Run through a "real" profiler to get finer grained metrics.
Ok I think I understand why "native" is not behaving the way I would expect. And I think I know why netcdf IO is using so much memory. Each rank is carrying m_output_buf and m_sample_buf of size nparticles * nvars. This is totally unnecessary for the native IO. And probably way to big for the netcdf IO. My first step is going to be about making the native IO behave as I expect it to behave. Then deal with netcdf.
Got some good news, at least on the native side:
4 planes, current amr-wind
4 planes, https://github.com/Exawind/amr-wind/pull/1207
30 planes, my branch, not possible with current amr-wind
Conclusions
- for 4 planes, most ranks just use around 200MB, which is probably just slightly more than with no planes
- there is a rank using much more memory in spikes. That has to do with how we do particle init (I think. Thought it doesn't explain everything). I will be working on this next.
- netcdf is still going to be an issue but I have thoughts on how to fix that too
#1209 has another round of improvements. Repeating the conclusion of that PR here:
- This PR removes (over 2 time steps), 3 of the 4 huge memory spikes on rank 0.
- Instead of 40GB of memory spike, it is now only a single 10GB memory spike, or a 4x improvement
- We also got a speedup: 4.25X per time step (2X over the total run time, 1.7 for init)
This should help the native and netcdf samplers.
This issue is stale because it has been open 30 days with no activity.
I keep seeing quite a high memory consumption for large data sampling using netcdf sampling. From my feeling there is no significant improvement compared to the older versions.
Has the solution of this issue already been validated with netcdf sampling? Or am I just sampling too much data?
Hi, Thank you for reaching out. Improving the memory consumption of the samplers is an ongoing effort. Right now our focus is on improving the memory consumption of the native pathway for the samplers since they are more performant to begin with (and improvements there impact the netcdf samplers at the same time). So my recommendation has been to encourage users to use that pathway. There are example scripts in the tools directory for manipulating the resulting data with python.
After #1235 I think I am going to spend time on other things and close this for now. The native format for samplers is now fully scalable (no bottlenecks on rank 0, no extra memory consumption on rank 0). I would encourage all users to use the native samplers. We have example scripts (see e.g. discussion here: https://github.com/Exawind/amr-wind/discussions/1305#discussioncomment-11009986) for reading the amrex particles.
The netcdf pathway has the current limitations that would need to be lifted:
- each rank holds the full vector of points (lots of memory), the particles on that rank send their data to those entries in the vector and then there is a sum reduction of that vector of points to the io rank (basically empty entries on ranks are ignored, memory bloat). Lifting this would imply a way of each rank only holding the vector of points it owns (fine) but then somehow sending that data to the io rank in a way that it gets indexed into the full vector of points properly. We've done that with something like an MPI
ialltoallvbut it gets gnarly fast. And it still doesn't make the process scalable because... - because we are still using a single io rank to gather all the data and the writing to file. Which is a bottleneck.
- the long term solution would be to implement a parallel netcdf write. It's unclear to me the benefits of that if we already have a fast, scalable way of sampling (via the native format).
I am willing to entertain counter arguments here so please feel free to voice your thoughts. My thinking right now would be to write a simple conversion tool as a post-simulation step if netcdf format is absolutely necessary. I don't know when I will get to such a thing but I am happy to help people with the existing python tools we have for reading these amrex particles.
Thanks for the detailed clarification. For us the problem with the native format is, that it produces lots of small chunkfiles which raises a cluster specific issues. Our cluster has a predefined chunkfile of 10^6 limit of 10^6 files for each user account. That limit would already be exceeded by one of the larger cases when using the native sampling format, for example for precoursor data. So in fact native is not usable on our cluster. The only way to use AMRWind with larger samplings on our cluster is with NETCDF output, where we are currently struggling a lot with the extra memory consumption.
huh. That's a new one for me ;) Each user can have a max of 10^6 files total? Anyway, what's is the value of the amr.max_grid_size in your input file and are you using CPUs or GPUs?
By chunkfile do you mean inode?
I am still curious about the value of that parameter. But you can control the number of files for particles (the samplers) with the input file command: particles.particles_nfiles = 256, where 256 is the default. You can change it to a smaller value, say 64 or 32 or 16.
here's what that looks like:
1 rank, default
np1/post_processing/volume_sampling00000
└── particles
├── Header
├── Level_0
│ ├── DATA_00000
│ └── Particle_H
└── Level_1
├── DATA_00000
└── Particle_H
4 directories, 5 files
10 ranks, default
❯ tree np10/post_processing/volume_sampling00000
np10/post_processing/volume_sampling00000
└── particles
├── Header
├── Level_0
│ ├── DATA_00000
│ ├── DATA_00006
│ ├── DATA_00007
│ └── Particle_H
└── Level_1
├── DATA_00003
├── DATA_00006
├── DATA_00007
└── Particle_H
4 directories, 9 files
10 ranks, particles.particles_nfiles=1
❯ tree post_processing/volume_sampling00000
post_processing/volume_sampling00000
└── particles
├── Header
├── Level_0
│ ├── DATA_00000
│ └── Particle_H
└── Level_1
├── DATA_00000
└── Particle_H
4 directories, 5 files
I am adding a new user option to do the same type of control for the plot and checkpoint files: #1320. Once that is merged you could add io.nfiles=32 (defaults 256) to reduce the number of files in the plot and checkpoint directories.
Yes, exactly each user has a maximum of 10^6 files in total, which is quite an annoying limitation. We have already been in touch with HPC support to inform them of that limitation. According to them, an unsuitable filesystem imposes that limit, which they are aware of, but there is also no quick workaround. So for now we have to live with that limitation and find our ways to bypass it.
Thanks for informing me about the possibility to reduce the number of files and for adding the option also for the checkpoints. That could indeed be one of these bypasses to live with that chunkfile limit. I will try it in the next weeks and keep you posted about our experiences.