yt icon indicating copy to clipboard operation
yt copied to clipboard

[WIP] Adds function for creating particle-based dataset from grid-based dataset

Open chummels opened this issue 5 years ago • 24 comments

PR Summary

This adds the monte_carlo_sample() function for converting grid-based datasets to particle-based datasets. This can be used in a number of ways including providing a means of direct comparisons between grid-based and particle-based codes using the same starting point. However, this will help yt to be used with the FIREFly code for real-time visualization in the web-browser for both particle-based as well as grid-based datasets (FIREFly was built for the particle-based FIRE simualtions).

This is a WIP so I'm open to making significant changes to this code, like its location and method. Couldn't find an obvious place to put it which is why it is here.

This function is reasonably fast at the 1e5-1e6 sampling level, operating in just a few seconds when I use ('gas', 'dx') as an analog for the ('gas', 'smoothing_length'). However, when I use the kdtree to determine the smoothing length field, it introduces some problems. One, it is a bit slower. Two, the smoothing lengths appear to be really really large, causing the image generation step to take a lot longer and the image to get all washed out. I guess I could just arbitrarily shrink the smoothing_length value from the one calculated by the kdtree, but I figure either I'm doing something wrong or there is a bug in the generate_smoothing_length() code. Images below. Also, almost all of the code for the kdtree stuff was taken directly from @qobilidop 's open PR #2186 (thanks, @qobilidop !), so it may make more sense to modularize some of these functions and use them in both pieces of functionality.

Script to test and generate images:

import yt
from yt.data_objects.data_containers import \
    monte_carlo_sample
fn = '~/src/yt-data/IsolatedGalaxy/galaxy0030/galaxy0030'
ds = yt.load(fn)
ds_particle = monte_carlo_sample(ds)

p = yt.ProjectionPlot(ds, 'z', ('gas', 'density'), width=(0.04, 'Mpc'))
p.set_zlim(('gas', 'density'), 1e-4, 1e-2)
p.save('grid.png')

p = yt.ProjectionPlot(ds_particle, 'z', ('gas', 'density'), width=(0.04, 'Mpc'))
p.set_zlim(('gas', 'density'), 1e-4, 1e-2)
p.save('particles.png')

original grid dataset -- grid.png:

grid

generated particle dataset -- particle.png (incorrectly using dx as the smoothing_length):

particle_fixed

generated particle dataset -- particle.png (using the kdtree to generate the smoothing_length):

particles

PR Checklist

  • [ ] Code passes flake8 checker
  • [ ] New features are documented, with docstrings and narrative docs
  • [ ] Adds tests.

chummels avatar Mar 06 '19 03:03 chummels

Hi there, @chummels! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Hooray! :fireworks:

Comment last updated at 2019-03-20 05:38:41 UTC

pep8speaks avatar Mar 06 '19 03:03 pep8speaks

@chummels It's reporting far more PEP8 errors than our flake8 usually catches, but I'm not sure why. In the latest commit on master it excludes these. So for now, I would suggest ignoring its reply...

matthewturk avatar Mar 06 '19 03:03 matthewturk

(The PEP-8 bot should be fixed by #2188 .)

matthewturk avatar Mar 06 '19 03:03 matthewturk

That's pretty sweet it gives all this feedback! Much better than the PEP8 tests failing and then you have to look through the test outputs to figure out what went wrong.

chummels avatar Mar 06 '19 04:03 chummels

dx_vs_smoothing_len

No, not linearly increasing. In this simulation, the grids are for the most part nested around the galaxy with discrete transitions, whereas the smoothing length not so much.

chummels avatar Mar 06 '19 18:03 chummels

But it does illustrate that the smoothing lengths are ~10x larger than the dx values, hence the really washed out structure in the final image.

chummels avatar Mar 06 '19 18:03 chummels

If you use more particles to generate the sampled particle dataset does it converge better and look less washed out?

ngoldbaum avatar Mar 06 '19 18:03 ngoldbaum

Yes, a bit, but it becomes increasingly long processing time. For references, the number of grid elements in the original dataset is 3.6e6. Here are the results of various different values of n_samples (i.e., particles) approaching the total number of grids, along with the processing time.

1e3 particles, 9 seconds on a Macbook Pro: particles_1e3

1e4 particles, 14 seconds on a Macbook Pro: particles_1e4

1e5 particles, 44 seconds on a Macbook Pro: particles_1e5

1e6 particles, 483 seconds on a Macbook Pro: particles_1e6

Things sharpen a bit as I drop the n_neighbors value, which is why my default is down at 8 instead of 64. This is 1e6 particles with n_neighbors of 2, 1206 seconds on a Macbook Pro: particles

In contrast to 1e6 particles using the dx field for the smoothing length, 15 seconds on a Macbook Pro: particles

chummels avatar Mar 06 '19 23:03 chummels

For further reference, using n_neighbors = 1 and n_samples = 3e6 (the same number of particles now as the number of cells in the original dataset, we still don't entirely look like the original dataset. I don't know of a way to make it "sharper", and the processing time on this took over an hour, for what is nominally not that large of an original dataset. 3884 seconds.

particles

chummels avatar Mar 07 '19 01:03 chummels

It seems to me that because we're producing the particle-based dataset from the original grid-based dataset, there is additional information present that can be used for the smoothing_length instead of just calculating it from scratch, which is nominally what we're doing by invoking the cykdtree calculation step. But it seems that just using the dx value is inappropriate and arguably too small, which is leading to the noise seen when we use that method. It also leads to grid artifacts across discrete grid boundaries in the original dataset. But perhaps some variant on that?

I've tried smoothing the dx field with a gaussian kernel before using it as the smoothing_length, which works OK, but ends up smearing out structure between the galactic disk and its halo, unsurprisingly. But it does get rid of the grid artifacts. And it's fast. Definitely needs some more thought.

particles

chummels avatar Mar 07 '19 01:03 chummels

It occurs to me that it should be possible to convert an AMR dataset into an SPH dataset than back into an AMR dataset using @AshKelly's cyoctree :)

ngoldbaum avatar Mar 07 '19 19:03 ngoldbaum

Yeah, a bit like google translating things back and forth to . I think it would introduce some noise, but should in principle work.

chummels avatar Mar 07 '19 19:03 chummels

What happens if you crank up the number of particles to say, 1e7? It might take forever but does it get the right answer? What about self-consistent particle merging afterwards to follow the density gradient?

agurvich avatar Mar 08 '19 15:03 agurvich

@chummels do you know what the rate determining step is? There might also be straightforward optimizations that could be applied...

matthewturk avatar Mar 08 '19 16:03 matthewturk

@MatthewTurk The bottleneck in this process is never the monte carlo step. In the case where the number of particles (n_samples) is large (>=1e6), then the bottleneck is the creation of the kdtree. The other slowdown can occur when the smoothing length is relatively large, in which case the actual ProjectionPlot rendering step takes forever because each pixel, which is intrinsically a line integral, has to include more particles that overlap its trajectory.

chummels avatar Mar 11 '19 16:03 chummels

@agurvich I can try with 1e7 particles, but it's going to take like 8-12 hours to run, and I don't think this is a viable solution for most datasets, given the fact that this is a reasonably small, simple dataset.

chummels avatar Mar 11 '19 16:03 chummels

yes, I agree. Just curious if it converges. Claude-André and I were discussing it at group meeting Friday morning.

agurvich avatar Mar 11 '19 17:03 agurvich

I addressed a bunch of the feedback from the first iteration. It now works with a specified data object; it can either calculate the smoothing length from the kdtree or just use the dx values of the original dataset for speed or when the smoothing length isn't necessary to be 100% accurate (i.e., firefly) and allows a non-mass weighting. I'm still open to moving this to another location or changing it if there are other suggestions. When PR #2186 merges, I will use the functions @qobilidop included to avoid too much code repetition.

chummels avatar Mar 20 '19 05:03 chummels

@chummels Do you still want this to go in, and if so, do you need help with the merge conflicts?

matthewturk avatar Sep 16 '20 14:09 matthewturk

Yeah, I think this would be useful to go in, since all the basic functionality is there. I guess we just never settled on what to use as the smoothing length, whether to grab a smoothed version of dx or to let the kd_tree generate it, which is either really slow but reasonably accurate (for lots of particles) or fast and really washed out (for few particles). I guess we could have both options available for the end user and describe the caveats of both methods?

I haven't looked at all the conflicts, but I'd be up for giving this the final push to get in the code. I think it could be useful for code comparison projects or at the very least for cramming grid-based datasets into tools that only work with particle-based datasets (like firefly or potentially cosmovis).

chummels avatar Dec 12 '20 04:12 chummels

I'd love to play around with some of this in Firefly, so I'd definitely like to see it merged. The only thing I can add is that I've been using this code to compute smoothing lengths for my own purposes (I think either Phil or Volker wrote it originally).

I imagine it's pretty similar to what y'all are using for the KD tree ¯_(ツ)_/¯

agurvich avatar Dec 12 '20 12:12 agurvich

we just never settled on what to use as the smoothing length, whether to grab a smoothed version of dx or to let the kd_tree generate it, which is either really slow but reasonably accurate (for lots of particles) or fast and really washed out (for few particles). I guess we could have both options available for the end user and describe the caveats of both methods?

I'm very late to the party but this sounds reasonable to me; if there is no sane default, it should be document why and the decision left to the user.

@chummels if you're still up for it, I think we can push this forward. For now I will refresh CI.

neutrinoceros avatar Aug 24 '22 13:08 neutrinoceros

Ah the close/reopen trick doesn't re-trigger most jobs because the branch is older than our migration to Github Actions. Instead I will merge from main by resolving the only conflict here.

Since the conflict is only in import statements, I can resolve it without breaking anything by just accepting every line. Pre-commit + isort will remove duplicates and flag any unused import.

neutrinoceros avatar Aug 24 '22 13:08 neutrinoceros

pre-commit.ci autofix

neutrinoceros avatar Aug 24 '22 13:08 neutrinoceros