yt icon indicating copy to clipboard operation
yt copied to clipboard

Explore loading very large unigrid data using dask

Open ngoldbaum opened this issue 6 years ago • 6 comments

It should be possible to provide an interface for loading arbitrary large unigrid data in an out-of-core fashion, as mediated by dask.

This might be an interesting way to support a code like Cholla.

ngoldbaum avatar Jul 13 '18 15:07 ngoldbaum

The array design doc describes how to supply chunks, etc.

Something that may require a refactor in yt would be that we have chunks, but we don't necessarily know the size in advance, and we don't necessarily have a method that can (easily) do random access on the chunk level.

matthewturk avatar Jul 13 '18 15:07 matthewturk

https://github.com/dask/dask/pull/1838#issue-97248379 addresses unknown chunk sizes

matthewturk avatar Jul 13 '18 15:07 matthewturk

The iteration-first model of chunked IO in most of the frontends (for io style chunking) makes it slightly harder to do this, but fixable with a bit of refactoring. I believe that some of the grid_visitors work in #1346 refactors this for grid objects in a way that would make it work directly, but as it stands the _read_fluid_selection function in the IOHandler is a bit clunky to make work with this.

matthewturk avatar Jul 13 '18 16:07 matthewturk

Here's an example Cholla dataset from Evan: https://www.dropbox.com/s/xoba7ibdqgt87c0/60.h5?dl=0

It should look like this:

image image

ngoldbaum avatar Jul 13 '18 16:07 ngoldbaum

One major block might be the dask unit support, aka the dask-unyt integration in our case. I came across some relevant discussions which might be worth noting down here for future reference:

qobilidop avatar Apr 11 '19 17:04 qobilidop

We spent some time looking at dask in a group today (cc @jcoughlin11 @cassidymwagner @munkm ) and came up with some things that we aren't sure how to address. But! We did get a really simple proof of concept going, although I have to note that this does not lazy-load data, although it shows how one might proceed to lazy loading data.

(EDIT: Turns out, not quite on the units thing.) Units are passed through this, too, which seems neat.

keys = {
 ('simple', 0, 0): (operator.getitem, dd["particle_position"], slice(0, 1000)),
 ('simple', 1, 0): (operator.getitem, dd["particle_position"], slice(1000, 11907080))
}
chunks = ((1000, 11907080-1000,), (3,))
arr = da.Array(keys, "simple", chunks, np.float64)
arr2 = arr**2

print(arr2.compute(), arr.compute(), arr2.shape)

It seems to me that this means that once we have a simple key/value from "chunks" in a data-object to sizes (which may need to be on a per-field basis) we could issue dask arrays from data objects.

matthewturk avatar May 31 '19 18:05 matthewturk