amr-wind
amr-wind copied to clipboard
Remove memory spike in Sampling
Summary
This removes the memory spike on the ioproc when doing sampling by initializing the sampling particles in parallel.
Pull request type
Please check the type of change introduced:
- [ ] Bugfix
- [ ] Feature
- [ ] Code style update (formatting, renaming)
- [x] Refactoring (no functional changes, no api changes)
- [ ] Build related changes
- [ ] Documentation content changes
- [ ] Other (please describe):
Checklist
The following is included:
- [ ] new unit-test(s)
- [x] new regression test(s)
- [ ] documentation for new capability
This PR was tested by running:
- the unit tests
- [ ] on GPU
- [ ] on CPU
- the regression tests
- [ ] on GPU
- [ ] on CPU
Additional background
Issue Number: #1186
This is great except that it now all ranks need to know about the probe locations. So it ends up needing more memory in total though there is no spike. The next step that I want to do before this PR is merged is to have the ranks only need to know about probe locations that are relevant to it. Basically I want to pass box to sampling_locations and only get the locations that are inside with that box. But I need to switch to something else for a bit.
Status update before I leave this aside for a bit:
- there is currently are hard check that the number of particles added match what was expected to be added. This is a good check. But if the sampling location is not in the domain, then this will fail. Do I silently filter out sampling locations outside the domain? Or abort? This shows up in the
dam_break_godunovtest case. - Now that I can dynamically add sampling locations, I am going to have each rank only do the sampling locations that are in the boxes that it owns. That should further lower the memory consumption.
Progress is being made... I need to fix the m_fill_val in radar sampler to be the min of the domain (checking to see if that's allowed). And then netcdf builds need to be fixed with the realvect change. And then finally I can call sampling_locations with the locally owned boxes. Phew.
We are close. I added the box check which is the thing that should help the memory pressure. I need to check it on a large case and see if it actually works/causes no diffs. The only other thing I am "worried" about is the radar sampler since I don't have an input file for that.
Ran the new reg test with MPI and everything on and I don't see any diffs in the particle files. I would like to double check netcdf. But here's the upshot:
- successfully removes the memory spike: 10GB spike is gone and the max we see is about 250MB on each rank.
- however the initialize of the particles is much slower with this PR. This is a bummer. Basically we are trading memory usage for computation. TANSTAAFL. One thing I want to think about is if there's a way to give it a bounding box...
This PR
Time spent in InitData(): 106.9760745
Time spent in Evolve(): 4.634056547
main branch
Time spent in InitData(): 10.57266587
Time spent in Evolve(): 4.56230204
This is the diff with the big case I am using to benchmark (40M probes):
absolute_error relative_error
uid 0.000000e+00 0.000000e+00
set_id 0.000000e+00 0.000000e+00
probe_id 0.000000e+00 0.000000e+00
xco 1.104635e-08 1.092440e-16
yco 0.000000e+00 0.000000e+00
zco 0.000000e+00 0.000000e+00
velocityx 4.813321e-13 9.463192e-18
velocityy 7.431559e-13 1.052051e-15
velocityz 0.000000e+00 0.000000e+00
Ok here's the updated plots with the improvements I made:
And now we are back to something reasonable:
Time spent in InitData(): 16.04659739
Time spent in Evolve(): 4.62255488
Checklist:
- [x] check netcdf
- [ ] rerun all the tests
- [x] check abl_sampling test and large test files from a GPU run
- [x] check free surface
- [x] I think I can actually simplify now that I did all that extra work
Because I can't leave well enough alone... That last simplification means that now we are even faster than the main branch in the init (2X speedup). Turns out TANSTAAFL is not a thing.
Time spent in InitData(): 5.056825252
Time spent in Evolve(): 5.084482466
not sure if you've incorporated my additions to probe sampler, but hopefully those won't be too bad to address.
not sure if you've incorporated my additions to probe sampler, but hopefully those won't be too bad to address.
Yup I merged those in when you made the changes. You had good tests in there that made it easy.
That last one made the init 10% faster than the one before.
Time spent in InitData(): 4.615414819
Time spent in Evolve(): 5.122051878
Not enough to claim that the 80s were right about greed. I will keep it though.
abl_sampling_netcdf with 10 ranks on Kestrel: This PR:
Time spent in InitData(): 0.38666686
Time spent in Evolve(): 1.077089947
main branch:
Time spent in InitData(): 0.483688814
Time spent in Evolve(): 1.074861178
Things "look" right:
❯ cd post_processing_dev
❯ du -shc *
28K abl_statistics00000.nc
32K line_sampling00000.nc
136K plane_sampling00000.nc
32K probe_sampling00000.nc
5.8M volume_sampling00000.nc
88K volume_sampling200000.nc
6.1M total
❯ ..
❯ cd post_processing_main
❯ du -shc *
28K abl_statistics00000.nc
32K line_sampling00000.nc
136K plane_sampling00000.nc
32K probe_sampling00000.nc
5.8M volume_sampling00000.nc
88K volume_sampling200000.nc
6.1M total
But I would like to have an nccmp output... which isn't on any machine I have access to yet.
got nccmp added to kestrel:
❯ nccmp -d post_processing_dev/volume_sampling00000.nc post_processing_main/volume_sampling00000.nc
❯ nccmp -d post_processing_dev/volume_sampling200000.nc post_processing_main/volume_sampling200000.nc
good to go for netcdf
I checked one of our cases which crashed before on CPU in the first step. I changed the output format to native and it has already run fine for 100 seconds. Will keep you updated.
With some final tweaks, I think we are good to go. Running the GPU tests now.