yt
yt copied to clipboard
[WIP] Adds function for creating particle-based dataset from grid-based dataset
PR Summary
This adds the monte_carlo_sample()
function for converting grid-based datasets to particle-based datasets. This can be used in a number of ways including providing a means of direct comparisons between grid-based and particle-based codes using the same starting point. However, this will help yt to be used with the FIREFly code for real-time visualization in the web-browser for both particle-based as well as grid-based datasets (FIREFly was built for the particle-based FIRE simualtions).
This is a WIP so I'm open to making significant changes to this code, like its location and method. Couldn't find an obvious place to put it which is why it is here.
This function is reasonably fast at the 1e5-1e6 sampling level, operating in just a few seconds when I use ('gas', 'dx')
as an analog for the ('gas', 'smoothing_length')
. However, when I use the kdtree to determine the smoothing length field, it introduces some problems. One, it is a bit slower. Two, the smoothing lengths appear to be really really large, causing the image generation step to take a lot longer and the image to get all washed out. I guess I could just arbitrarily shrink the smoothing_length
value from the one calculated by the kdtree, but I figure either I'm doing something wrong or there is a bug in the generate_smoothing_length()
code. Images below. Also, almost all of the code for the kdtree stuff was taken directly from @qobilidop 's open PR #2186 (thanks, @qobilidop !), so it may make more sense to modularize some of these functions and use them in both pieces of functionality.
Script to test and generate images:
import yt
from yt.data_objects.data_containers import \
monte_carlo_sample
fn = '~/src/yt-data/IsolatedGalaxy/galaxy0030/galaxy0030'
ds = yt.load(fn)
ds_particle = monte_carlo_sample(ds)
p = yt.ProjectionPlot(ds, 'z', ('gas', 'density'), width=(0.04, 'Mpc'))
p.set_zlim(('gas', 'density'), 1e-4, 1e-2)
p.save('grid.png')
p = yt.ProjectionPlot(ds_particle, 'z', ('gas', 'density'), width=(0.04, 'Mpc'))
p.set_zlim(('gas', 'density'), 1e-4, 1e-2)
p.save('particles.png')
original grid dataset -- grid.png:
generated particle dataset -- particle.png (incorrectly using dx
as the smoothing_length
):
generated particle dataset -- particle.png (using the kdtree to generate the smoothing_length
):
PR Checklist
- [ ] Code passes flake8 checker
- [ ] New features are documented, with docstrings and narrative docs
- [ ] Adds tests.
Hi there, @chummels! Thanks for updating this PR.
There are currently no PEP 8 issues detected in this Pull Request. Hooray! :fireworks:
Comment last updated at 2019-03-20 05:38:41 UTC
@chummels It's reporting far more PEP8 errors than our flake8 usually catches, but I'm not sure why. In the latest commit on master it excludes these. So for now, I would suggest ignoring its reply...
(The PEP-8 bot should be fixed by #2188 .)
That's pretty sweet it gives all this feedback! Much better than the PEP8 tests failing and then you have to look through the test outputs to figure out what went wrong.
No, not linearly increasing. In this simulation, the grids are for the most part nested around the galaxy with discrete transitions, whereas the smoothing length not so much.
But it does illustrate that the smoothing lengths are ~10x larger than the dx values, hence the really washed out structure in the final image.
If you use more particles to generate the sampled particle dataset does it converge better and look less washed out?
Yes, a bit, but it becomes increasingly long processing time. For references, the number of grid elements in the original dataset is 3.6e6. Here are the results of various different values of n_samples
(i.e., particles) approaching the total number of grids, along with the processing time.
1e3 particles, 9 seconds on a Macbook Pro:
1e4 particles, 14 seconds on a Macbook Pro:
1e5 particles, 44 seconds on a Macbook Pro:
1e6 particles, 483 seconds on a Macbook Pro:
Things sharpen a bit as I drop the n_neighbors
value, which is why my default is down at 8 instead of 64. This is 1e6 particles with n_neighbors
of 2, 1206 seconds on a Macbook Pro:
In contrast to 1e6 particles using the dx
field for the smoothing length, 15 seconds on a Macbook Pro:
For further reference, using n_neighbors = 1
and n_samples = 3e6
(the same number of particles now as the number of cells in the original dataset, we still don't entirely look like the original dataset. I don't know of a way to make it "sharper", and the processing time on this took over an hour, for what is nominally not that large of an original dataset. 3884 seconds.
It seems to me that because we're producing the particle-based dataset from the original grid-based dataset, there is additional information present that can be used for the smoothing_length
instead of just calculating it from scratch, which is nominally what we're doing by invoking the cykdtree calculation step. But it seems that just using the dx
value is inappropriate and arguably too small, which is leading to the noise seen when we use that method. It also leads to grid artifacts across discrete grid boundaries in the original dataset. But perhaps some variant on that?
I've tried smoothing the dx
field with a gaussian kernel before using it as the smoothing_length
, which works OK, but ends up smearing out structure between the galactic disk and its halo, unsurprisingly. But it does get rid of the grid artifacts. And it's fast. Definitely needs some more thought.
It occurs to me that it should be possible to convert an AMR dataset into an SPH dataset than back into an AMR dataset using @AshKelly's cyoctree :)
Yeah, a bit like google translating things back and forth to
What happens if you crank up the number of particles to say, 1e7? It might take forever but does it get the right answer? What about self-consistent particle merging afterwards to follow the density gradient?
@chummels do you know what the rate determining step is? There might also be straightforward optimizations that could be applied...
@MatthewTurk The bottleneck in this process is never the monte carlo step. In the case where the number of particles (n_samples
) is large (>=1e6), then the bottleneck is the creation of the kdtree. The other slowdown can occur when the smoothing length is relatively large, in which case the actual ProjectionPlot
rendering step takes forever because each pixel, which is intrinsically a line integral, has to include more particles that overlap its trajectory.
@agurvich I can try with 1e7 particles, but it's going to take like 8-12 hours to run, and I don't think this is a viable solution for most datasets, given the fact that this is a reasonably small, simple dataset.
yes, I agree. Just curious if it converges. Claude-André and I were discussing it at group meeting Friday morning.
I addressed a bunch of the feedback from the first iteration. It now works with a specified data object; it can either calculate the smoothing length from the kdtree or just use the dx
values of the original dataset for speed or when the smoothing length isn't necessary to be 100% accurate (i.e., firefly) and allows a non-mass weighting. I'm still open to moving this to another location or changing it if there are other suggestions. When PR #2186 merges, I will use the functions @qobilidop included to avoid too much code repetition.
@chummels Do you still want this to go in, and if so, do you need help with the merge conflicts?
Yeah, I think this would be useful to go in, since all the basic functionality is there. I guess we just never settled on what to use as the smoothing length, whether to grab a smoothed version of dx or to let the kd_tree generate it, which is either really slow but reasonably accurate (for lots of particles) or fast and really washed out (for few particles). I guess we could have both options available for the end user and describe the caveats of both methods?
I haven't looked at all the conflicts, but I'd be up for giving this the final push to get in the code. I think it could be useful for code comparison projects or at the very least for cramming grid-based datasets into tools that only work with particle-based datasets (like firefly or potentially cosmovis).
I'd love to play around with some of this in Firefly, so I'd definitely like to see it merged. The only thing I can add is that I've been using this code to compute smoothing lengths for my own purposes (I think either Phil or Volker wrote it originally).
I imagine it's pretty similar to what y'all are using for the KD tree ¯_(ツ)_/¯
we just never settled on what to use as the smoothing length, whether to grab a smoothed version of dx or to let the kd_tree generate it, which is either really slow but reasonably accurate (for lots of particles) or fast and really washed out (for few particles). I guess we could have both options available for the end user and describe the caveats of both methods?
I'm very late to the party but this sounds reasonable to me; if there is no sane default, it should be document why and the decision left to the user.
@chummels if you're still up for it, I think we can push this forward. For now I will refresh CI.
Ah the close/reopen trick doesn't re-trigger most jobs because the branch is older than our migration to Github Actions. Instead I will merge from main by resolving the only conflict here.
Since the conflict is only in import statements, I can resolve it without breaking anything by just accepting every line. Pre-commit + isort will remove duplicates and flag any unused import.
pre-commit.ci autofix