yt
yt copied to clipboard
Explore loading very large unigrid data using dask
It should be possible to provide an interface for loading arbitrary large unigrid data in an out-of-core fashion, as mediated by dask.
This might be an interesting way to support a code like Cholla.
The array design doc describes how to supply chunks, etc.
Something that may require a refactor in yt
would be that we have chunks, but we don't necessarily know the size in advance, and we don't necessarily have a method that can (easily) do random access on the chunk level.
https://github.com/dask/dask/pull/1838#issue-97248379 addresses unknown chunk sizes
The iteration-first model of chunked IO in most of the frontends (for io
style chunking) makes it slightly harder to do this, but fixable with a bit of refactoring. I believe that some of the grid_visitors
work in #1346 refactors this for grid objects in a way that would make it work directly, but as it stands the _read_fluid_selection
function in the IOHandler
is a bit clunky to make work with this.
Here's an example Cholla dataset from Evan: https://www.dropbox.com/s/xoba7ibdqgt87c0/60.h5?dl=0
It should look like this:
One major block might be the dask unit support, aka the dask-unyt integration in our case. I came across some relevant discussions which might be worth noting down here for future reference:
We spent some time looking at dask in a group today (cc @jcoughlin11 @cassidymwagner @munkm ) and came up with some things that we aren't sure how to address. But! We did get a really simple proof of concept going, although I have to note that this does not lazy-load data, although it shows how one might proceed to lazy loading data.
(EDIT: Turns out, not quite on the units thing.) Units are passed through this, too, which seems neat.
keys = {
('simple', 0, 0): (operator.getitem, dd["particle_position"], slice(0, 1000)),
('simple', 1, 0): (operator.getitem, dd["particle_position"], slice(1000, 11907080))
}
chunks = ((1000, 11907080-1000,), (3,))
arr = da.Array(keys, "simple", chunks, np.float64)
arr2 = arr**2
print(arr2.compute(), arr.compute(), arr2.shape)
It seems to me that this means that once we have a simple key/value from "chunks" in a data-object to sizes (which may need to be on a per-field basis) we could issue dask arrays from data objects.