amr-wind Remove memory spike in Sampling

Summary

This removes the memory spike on the ioproc when doing sampling by initializing the sampling particles in parallel.

Pull request type

Please check the type of change introduced:

[ ] Bugfix
[ ] Feature
[ ] Code style update (formatting, renaming)
[x] Refactoring (no functional changes, no api changes)
[ ] Build related changes
[ ] Documentation content changes
[ ] Other (please describe):

Checklist

The following is included:

[ ] new unit-test(s)
[x] new regression test(s)
[ ] documentation for new capability

This PR was tested by running:

the unit tests
- [ ] on GPU
- [ ] on CPU
the regression tests
- [ ] on GPU
- [ ] on CPU

Additional background

Issue Number: #1186

Sep 05 '24 17:09 marchdf

This is great except that it now all ranks need to know about the probe locations. So it ends up needing more memory in total though there is no spike. The next step that I want to do before this PR is merged is to have the ranks only need to know about probe locations that are relevant to it. Basically I want to pass box to sampling_locations and only get the locations that are inside with that box. But I need to switch to something else for a bit.

Sep 05 '24 17:09 marchdf

Status update before I leave this aside for a bit:

there is currently are hard check that the number of particles added match what was expected to be added. This is a good check. But if the sampling location is not in the domain, then this will fail. Do I silently filter out sampling locations outside the domain? Or abort? This shows up in the dam_break_godunov test case.
Now that I can dynamically add sampling locations, I am going to have each rank only do the sampling locations that are in the boxes that it owns. That should further lower the memory consumption.

Sep 11 '24 21:09 marchdf

Progress is being made... I need to fix the m_fill_val in radar sampler to be the min of the domain (checking to see if that's allowed). And then netcdf builds need to be fixed with the realvect change. And then finally I can call sampling_locations with the locally owned boxes. Phew.

Sep 20 '24 21:09 marchdf

We are close. I added the box check which is the thing that should help the memory pressure. I need to check it on a large case and see if it actually works/causes no diffs. The only other thing I am "worried" about is the radar sampler since I don't have an input file for that.

Oct 09 '24 17:10 marchdf

Ran the new reg test with MPI and everything on and I don't see any diffs in the particle files. I would like to double check netcdf. But here's the upshot:

successfully removes the memory spike: 10GB spike is gone and the max we see is about 250MB on each rank.
however the initialize of the particles is much slower with this PR. This is a bummer. Basically we are trading memory usage for computation. TANSTAAFL. One thing I want to think about is if there's a way to give it a bounding box...

This PR

Time spent in InitData():    106.9760745
Time spent in Evolve():      4.634056547

memory_30plane_dev

`main` branch

Time spent in InitData():    10.57266587
Time spent in Evolve():      4.56230204

memory_30plane_main-2

Oct 10 '24 18:10 marchdf

This is the diff with the big case I am using to benchmark (40M probes):

           absolute_error  relative_error
uid          0.000000e+00    0.000000e+00
set_id       0.000000e+00    0.000000e+00
probe_id     0.000000e+00    0.000000e+00
xco          1.104635e-08    1.092440e-16
yco          0.000000e+00    0.000000e+00
zco          0.000000e+00    0.000000e+00
velocityx    4.813321e-13    9.463192e-18
velocityy    7.431559e-13    1.052051e-15
velocityz    0.000000e+00    0.000000e+00

Oct 10 '24 20:10 marchdf

Ok here's the updated plots with the improvements I made: memory_30plane_dev-2

And now we are back to something reasonable:

Time spent in InitData():    16.04659739
Time spent in Evolve():      4.62255488

Oct 10 '24 22:10 marchdf

Checklist:

[x] check netcdf
[ ] rerun all the tests
[x] check abl_sampling test and large test files from a GPU run
[x] check free surface
[x] I think I can actually simplify now that I did all that extra work

Oct 10 '24 22:10 marchdf

Because I can't leave well enough alone... That last simplification means that now we are even faster than the main branch in the init (2X speedup). Turns out TANSTAAFL is not a thing.

Time spent in InitData():    5.056825252
Time spent in Evolve():      5.084482466

memory_30plane_dev-3

Oct 11 '24 14:10 marchdf

not sure if you've incorporated my additions to probe sampler, but hopefully those won't be too bad to address.

Oct 11 '24 14:10 mbkuhn

not sure if you've incorporated my additions to probe sampler, but hopefully those won't be too bad to address.

Yup I merged those in when you made the changes. You had good tests in there that made it easy.

Oct 11 '24 15:10 marchdf

That last one made the init 10% faster than the one before.

Time spent in InitData():    4.615414819
Time spent in Evolve():      5.122051878

Not enough to claim that the 80s were right about greed. I will keep it though.

Oct 11 '24 15:10 marchdf

Free surface sampler looks fine.

plot_sampling_native.pdf

Oct 11 '24 15:10 marchdf

abl_sampling_netcdf with 10 ranks on Kestrel: This PR:

Time spent in InitData():    0.38666686
Time spent in Evolve():      1.077089947

main branch:

Time spent in InitData():    0.483688814
Time spent in Evolve():      1.074861178

Things "look" right:

❯ cd post_processing_dev
❯ du -shc *
28K     abl_statistics00000.nc
32K     line_sampling00000.nc
136K    plane_sampling00000.nc
32K     probe_sampling00000.nc
5.8M    volume_sampling00000.nc
88K     volume_sampling200000.nc
6.1M    total
❯ ..
❯ cd post_processing_main
❯ du -shc *
28K     abl_statistics00000.nc
32K     line_sampling00000.nc
136K    plane_sampling00000.nc
32K     probe_sampling00000.nc
5.8M    volume_sampling00000.nc
88K     volume_sampling200000.nc
6.1M    total

But I would like to have an nccmp output... which isn't on any machine I have access to yet.

Oct 11 '24 16:10 marchdf

got nccmp added to kestrel:

❯ nccmp -d post_processing_dev/volume_sampling00000.nc post_processing_main/volume_sampling00000.nc
❯ nccmp -d post_processing_dev/volume_sampling200000.nc post_processing_main/volume_sampling200000.nc

good to go for netcdf

Oct 11 '24 16:10 marchdf

I checked one of our cases which crashed before on CPU in the first step. I changed the output format to native and it has already run fine for 100 seconds. Will keep you updated.

Oct 13 '24 19:10 hgopalan

With some final tweaks, I think we are good to go. Running the GPU tests now.

Oct 22 '24 19:10 marchdf

amr-wind amr-wind copied to clipboard

Remove memory spike in Sampling

Summary

Pull request type

Checklist

Additional background

This PR

main branch

amr-wind
amr-wind copied to clipboard

`main` branch