kerchunk
kerchunk copied to clipboard
Refactor MultiZarrToZarr into multiple functions
Problem
MultiZarrToZarr
is extremely powerful but rather hard to use.
This is important - kerchunk has been transformative, so we increasingly recommend it as the best way to ingest large amounts of data into the pangeo ecosystem's tools. However that means we should make sure the kerchunk user experience is smooth, so that new users don't get stuck early on.
Part of the problem is that this one MultiZarrToZarr
function can do many different things. Contrast with xarray - when combining multiple datasets into one, xarray takes some care to distinguish between a few common cases/concepts (we even have a glossary):
-
Concatenation along a single existing dimension. Achieved by
xr.concat
wheredim
is a str -
Concatenation along a single new dimension (optionally providing new coordinates to use along that new dimension). Achieved by
xr.concat
wheredim
is a set of values -
Merging of multiple variables which already share dimensions, first aligned according to their coordinates. Achieved by
xr.merge
-
"Combining" by order given, which means some ordered combination of concatenation along one or more dimensions and/or merging. Achieved by
xr.combine_nested
-
"Combining" by coordinate order, which again means some ordered combination of concatenation along one or more dimensions and/or merging, but the order is specified by information in the datasets' coordinates. Achieved by
xr.combine_by_coords
In kerchunk it seems that the recommended way to handle operations resembling all 5 of these cases is through MultiZarrToZarr
. It also cannot currently easily handle certain types of multi-dimensional concatenation.
Suggestion
Break up MultiZarrToZarr
by defining a set of functions similar to xarray's merge
/concat
/combine
/unify_chunks
that consume and produce VirtualZarrStore
objects (EDIT: see https://github.com/fsspec/kerchunk/issues/375).
Advantages
- We can replace/deprecate the heavily overloaded and unituitive
coo_map
kwarg (it has 10 possible input types!). Perhaps giving simply an ordered list of coordinate values would be sufficient, and just make it easier for the user to extract the values they want from theVirtualZarrStore
objects they want to concatenate. - If users need to do something really unusual they can more easily break their problem up into concatenating each array separately (e.g. for concatenating on staggered grids)
- Might generalise to later ZEPs more easily (e.g. understanding variable-length chunks, cc @ivirshup, see https://github.com/fsspec/kerchunk/pull/374)
- Can think of as a refactoring to move some pangeo-forge functionality upstream, reducing redundancy. We shouldn't have 3 completely different designs for multidimensional concatenation in adjacent libraries in the stack.
- These new functions would be more useful as basic primitives for parallelization frameworks to call (e.g. doing tree reduction via dask, beam, or cubed), rather than trying to wrap calls to those frameworks within kerchunk (like
kerchunk.combine.auto_dask
does).
Questions
- How close can these functions be to xarray's version of
merge
/concat
/combine
? And what can we learn from the design decisions in pangeo-forge-recipesFilePattern
? (@cisaacstern @rabernat ) - How close are kerchunk's existing
combine.merge_vars
andcombine.concatenate_arrays
functions to providing this functionality? If the answer is "pretty close", then how much of this issue could be solved via documentation?
I've been thinking of something similar. I'm not sure we can get all the flexibility we want out of just two functions, but think a couple classes may do the trick. I've been thinking of an API like:
g = kerchunk.KerchunkGroup()
# issubclass(KerchunkGroup, MutableMapping[str, KerchunkGroup | KerchunkArray])
# Concatenation + assignment, following array-api
g["x"]: kerchunk.KerchunkArray = kerchunk.concat([...], axis=0)
# Write a manifest
g.to_json(...)
# Getting a zarr group:
g.to_zarr()
# Reading from a manifest
g2 = kerchunk.KerchunkGroup.from_json(...)
# Assignment
g2["new_group"] = g
# Single level merging
g.update(g2)
# Add an array from a different store:
g.create_group("my").create_group("new")
g["my"]["new"]["array"] = zarr.open("matrix.zarr"): zarr.Array
Then leave anything more complicated to something like xarray
, which could re-use its combination logic on a collection of KerchunkArrays
.
This suggestion has a lot of the functionality already in concatenate_arrays
and merge_vars
, but:
- Makes manipulation of the hierarchy much easier (the other two you are manually prepending/ appending to the keys)
- User friendly and easy to introspect interface
- Possible performance benefits by avoiding repeated json string serialization
This idea may be more of a zarr
thing than a kerchunk
thing though.
Also, I think you may be missing a reference to https://github.com/fsspec/kerchunk/issues/375 under the "Suggestion" heading.
Right now we are using Kerchunk as a workaround for things that Zarr itself does not support, by implementing various types of indirection on chunk keys at the Store level. I think this is a great way to prototype new possibilities. The downside is that only kerchunk-aware clients can ever hope to read these datasets. Also, from a design point of view, stores are technically ignorant of the Zarr data model (arrays, groups, etc.); they are formally just low-level key value stores.
So when it comes time to refactor those prototypes into more stable interfaces, I think we should seriously consider upstreaming some of these features into Zarr itself, possibly via new Zarr extensions. The advantage of that approach is that
- We can implement these features via a specification, with an aim for multi-language support
- We can leverage Zarr's understanding of the data model (arrays, groups, chunks, dimensions, etc.) to make the implementation cleaner
I believe that all 5 of the features above could be implemented in this a way. In particular, the concept of a "chunk manifest": a document which stores information about where to find the chunks for an array (rather than assuming they follow the Zarr chunk encoding scheme) would get us very far. The outcome of this would be a Zarr array with much more powerful virtualization capabilities, something between Xarray's duck array types and the current Kerchunk situation.
We would be eager to collaborate on this and have some work in this direction planned for this quarter.
To be more concrete about what I mean, we could imagine writing code like this
import zarr
a1 = zarr.open("array1.zarr") # zarr.Array
a2 = zarr.open("array2.zarr") # zarr.Array
# this would create a Zarr array with the correct extensions to support indirection to a1 and a2
# (need to figure out the right API)
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")
a3[0] = 1 # write some data...automatically goes into the correct underlying array
We can implement these features via a specification, with an aim for multi-language support
We definitely want to see multi language support (cc: @manzt)
Do you have thoughts on what extensions to zarr would need to be made? Definitely some sort of specification for being able to pass around manifests. But maybe this means needing to define some set of stores as well?
Does mapping an HDF5 file to zarr fit as a zarr extension? As high priority of one?
we could imagine writing code like this
a3 = zarr.concat([a1, a2], axis=0, store="array3.zarr")
My idea currently would be that you don't need to assign to a store on concatenation. I think this could be a purely in memory object and you resolve everything at access or storage time.
I'd also even say I'm not sure a "reference" based array should be an instance or subclass of zarr.Array
. Maybe a sibling class, or just a shared interface would work. I'm thinking of use cases like creating virtual arrays which don't use the exact same codecs (like HDF5 virtual datasets allow).
a3[0] = 1 # write some data...automatically goes into the correct underlying array
I've mostly been thinking about read access at the moment. Not sure how this complicates things.
We would be eager to collaborate on this and have some work in this direction planned for this quarter.
Me too!
I've mostly been thinking about read access at the moment. Not sure how this complicates things.
read-only is of course easier, but also just a special case of read-write. Our arraylake friends have a well-defined model of where and how writes should go regardless of where the original mapper/chunks were defined.
We definitely want to see multi language support (cc: @manzt)
Awesome. Yah, something that should be very possible with zarrita.js.
Repeating https://observablehq.com/@manzt/ome-tiff-as-filesystemreference using some of these newer ideas would be awesome :) (and note that that original example was not, of course, a standard xarray-compatible dataset, but a subscale pyramid)
Sorry I've been out of the kerchunk loop for a bit but I've been trying to stay abreast of changes. zarrita.js has an HTTP-based reference store (and will try to map references with s3
/gcs
protocols to http endpoints). This higher-level virtualization of zarr is very exciting.
(sorry I haven't yet had the chance to review this)
Thanks @rabernat - I think the proposal of defining how arrays should be concatenated at the Zarr level is really cool, and I would be interested to help. I also think that working this out consistently at the Zarr level would help expose the ways in which the current kerchunk API brushes subtleties under the rug.
One of these subtleties actually just came up for me in practice in #386 - I got confused by whether concatenation in Kerchunk was supposed to follow xarray rules or lower-level array-like rules instead.
We therefore have to be really clear about our type systems, as we now have at least 3 different data structures for which concatenation would be defined differently:
- numpy-like arrays - no dimension names, no index-backed coordinates, so concatentation just follows axis number,
-
zarr (v3) arrays - has
dimension_names
, but no index-backed coordinates, so how should concatentation work? - xarray objects - has dimension names and index-backed coordinates, so concatenation follows dimension name but has to make exceptions for index-backed coordinates (see https://github.com/fsspec/kerchunk/issues/386#issuecomment-1792746248).
If we could get this straight it would be really powerful though. Also then ultimately kerchunk
might not need to deal with combining at all, it would become exclusively a collection of zarr readers for different legacy file formats.
zarr (v3) arrays - has dimension_names
Aren't dimensions name optional? In which case these should just behave like numpy arrays?
From https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#dimension-names
If specified, must be an array of strings or null objects with the same length as shape. An unnamed dimension is indicated by the null object. If dimension_names is not specified, all dimensions are unnamed.
Aren't dimensions name optional?
Oh I guess so!
In which case these should just behave like numpy arrays?
But obviously propagating the dimension_names
in a transparent and consistent way. Doesn't that potentially raise problems already? Like
z1 = zarr.Array(1d_data, dim_names=['t'])
z1 = zarr.Array(1d_data, dim_names=['x'])
zarr.concat([z1, z2], axis=0) # well-defined numpy-like operation, should return a 1D array, but what dim_name should the result have?
I'm just saying that having Zarr array type(s) that explicitly encode these propagation rules would be far better than implicitly doing this stuff in the depths of kerchunk.
EDIT: It reminds me of this suggestion https://github.com/data-apis/array-api/issues/698#issuecomment-1791959983
I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants. I am aware that to some extent we end up replicating stuff in xarray (like the encoding stuff), but I don't mind that for the same of being explicit. What we really need, is a good set of test cases to describe expected behaviour.
I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too.
I wonder how we can get together kerchunk-interested people? Do we need a dedicated slack or discord or something?
But obviously propagating the dimension_names in a transparent and consistent way. Doesn't that potentially raise problems already?
I think it's reasonable to let people do concatenation like numpy arrays regardless of what the dimension names are. array API as much as possible would be a good idea, just to reduce the amount of new things one needs to learn.
Could something like the merge logic in xarray be used for handling dimension names and other metadata (i.e. anything in attrs)? E.g. rules like "unique", "first"?
I've been thinking about operations other than concatenation that are achievable with kerchunk:
- Transpose/ permute dims (re-order chunks + add a transpose codec)
- Rename dimensions
- Modify metadata
- Limited subsetting
It would be nice to have these operations exposed in a way that doesn't require rewriting actual chunk data.
I wonder how we can get together kerchunk-interested people? Do we need a dedicated slack or discord or something?
If there was a zarr zulip/ discord/ slack maybe kerchunk could fit in there?
I think it's reasonable to let people do concatenation like numpy arrays regardless of what the dimension names are. array API as much as possible would be a good idea, just to reduce the amount of new things one needs to learn.
I second this, especially given that not all HDF or other files will be netCDF compliant anyway, and it would be good to have an opinion on what to do with them.
Transpose/ permute dims (re-order chunks + add a transpose codec) Rename dimensions Modify metadata Limited subsetting
Xarray already has a array wrapper type that does all this except concatentation and metadata handling: https://github.com/pydata/xarray/issues/5081.
For a long time I've felt like there's code we can reuse from dask for the concatenation piece. A LazilyConcatenatedArray
is really just a chunked array. And then when coerced to numpy, use dask's _concatenate3
or np.block
From kerchunk's point of view, though, we want to figure out where references belong, rather than building a dask graph. The code would probably be quite different?
I was wondering whether dask concatenation could get around the "different chunking" problem beyond var-chunks; i.e., each input dataset is first rechunked to a common grid and then concatenated, making an aggregate of aggregate datasets. That would be the only thing you could do in some circumstances. If the dask chunks are bigger than the native chunks, it may even be reasonably performant.
Xarray already has a array wrapper type that does all this except concatentation and metadata handling:
A LazilyConcatenatedArray is really just a chunked array.
This is a really useful observation, and I agree we could probably re-use code from xarray & dask. I think we should avoid a dependency on either though.
EDIT: Avoiding a dependency on xarray would be good motivation to create the standalone lazy arrays package.
I am weakly against extracting the logic and putting it in zarr. Zarr-python has proven to be very slow moving compared to the needs of kerchunk, and I feel like we probably will have a wider set of cases we want to cover than zarr wants.
I would also point out that kerchunk aims to be useful beyond zarr, and that it will be a place to test out cutting edge zarr features (like var-chunks) too.
To satisfy both of these concerns at once we would need to make a 3rd repository just for VirtualZarr
. I suppose that could then be eventually be moved in Zarr upstream. Not sure what people think about that idea.
Why not make that thing here in kerchunk?
Why not make that thing here in kerchunk?
Maybe it could, but I'm imagining there might be other uses for the VirtualZarr idea that aren't just in the kerchunk package? e.g. the ArrayLake stuff?
I do also think now that it would be conceptually neat to separate a concatenateable VirtualZarr class (/ "chunk manifest") from the kerchunk code that actually reads specific legacy file formats. I don't really have a strong opinion on this though.
Since the kerchunk package doesn't depend on other things, I think it might be a good place, perhaps just to start. I am happy to include more repo collaborators, if that's a blocker.
Xarray already has a array wrapper type that does all this except concatentation and metadata handling: https://github.com/pydata/xarray/issues/5081.
@dcherian I think the indexing capabilities of LazilyIndexedArray may be too powerful here. My understanding was that it can handle indexing statements that kerchunk can't, since kerchunk
needs all indexing to basically be picking chunks.
Though I would really love to see a split-out LazilyIndexedArray. May be able to solve a big class of issues we have over on anndata.
I do also think now that it would be conceptually neat to separate a concatenateable VirtualZarr class (/ "chunk manifest") from the kerchunk code that actually reads specific legacy file formats. I don't really have a strong opinion on this though.
I would have some interest in the in-memory class being something that can be written to a chunk manifest, but not neccesarily containing a chunk manifest. Some things I think this could help with:
- Not needing to serialize/ deserialize + parse so often
- Integrating in-memory arrays into a virtual structure without needing to make extra copies (plus ^)
-
virtual_group["a"] = virtual_array
Why not make that thing here in kerchunk?
I think it would make sense to have some prototyping before adding these to an established library.
If we have a zarr.Array
/kerchunk.Array
class that supports concatenation following the array API, could we imagine just using xarray as is to replace MultiZarrToZarr
?
We could make a kerchunk backend for xarray that returns a zarr.Array
, the user uses familiar xarray API to concatenate/merge everything together, then kerchunk provides a utility to serialize the resulting reference files. Users could literally just open their netCDF files in the same way they normally do:
ds = xr.open_mfdataset(
'/my/files*.nc',
engine='kerchunk', # kerchunk registers an xarray IO backend that returns zarr.Array objects
combine='nested', # 'by_coords' would require actually reading coordinate data
parallel=True, # would use dask.delayed to generate reference dicts for each file in parallel
)
ds # now wraps a bunch of zarr.Array / kerchunk.Array objects directly
ds.kerchunk.to_zarr(store='out.zarr') # kerchunk defines an xarray accessor that extracts the zarr arrays and serializes them (which could also be done in parallel if writing to parquet)
Pros:
- Users don't have to learn a new syntax for combining datasets
- Reads dimension names at the start and propagates them through
- Metadata can be handled at the
xr.Variable
level (i.e. xarray's attrs) - Handles zarr groups for free via xarray-datatree
- I don't think this would even need any changes in xarray to work (?)
- Removes one of the 3 places we have defined different concatenation logic (xarray/~~kerchunk~~/pangeo-forge)
- Backwards-compatible
- Generation of references is parallelizable through the
parallel=True
argument toopen_mfdataset
and concatenation of references is parallelizable by wrapping thekerchunk.Array
object withdask
/cubed.Array
(using their tree-reduce primitives), so no need forkerchunk.combine.auto_dask
either - Would therefore obviate the need for the entire
kerchunk.combine
module
Cons:
- Concatenation logic expects
ARRAY_DIMENSIONS
to exist - Users might not find xarray's concatenation logic intuitive for some things they want to do with kerchunk
- Xarray's data model and zarr's data model are not identical, which might cause problems
- Would be better if we could build xarray objects without indexes, and speed up
open_mfdataset
- Maybe something I'm missing?
Additional con: that would require opening and reading potentially thousands of reference sets at open_dataset time. If they all formed a regular grid such that you know where everything goes without opening them (from some naming convention or other manifest), well then you've done MultiZarrToZarr already.
opening and reading potentially thousands of reference sets
Doesn't that step have to happen however we do this?
If they all formed a regular grid such that you know where everything goes without opening them
The various keyword arguments to xr.open_mfdataset
allow you to distinguish between the cases of
(1) I am confident that the data lies in a regular grid, please just lazily open and concatenate everything,
(2) I am not confident, so please load coordinates and check they align etc. before you concatenate anything.
then you've done
MultiZarrToZarr
already.
In what sense? You still need to actually create the correct json reference file, or are you saying that should be trivial if you are 100% confident in the layout of the files a priori?
Doesn't that step have to happen however we do this?
In the kerchunk model, it happens once; in the concat model, it happens every time.
You still need to actually create the correct json reference file, or are you saying that should be trivial if you are 100% confident in the layout of the files a priori?
Both: if you need a manifest, someone has to make this, and kerchunk already can for a range of cases. xarray can do it for a set of already-open datasets but it cannot persist the overall dataset state. Do we need a third way?
If we can infer position from names (or something other that is simple), then making the manifest/reference set is indeed trivial. https://nbviewer.org/github/fsspec/kerchunk/blob/main/examples/earthbigdata.ipynb was an example of the latter (4D and 5D data), no sub-file coord decode required (but the parameter space is truly massive, and a pre-pass to see which parts are populated was really helpful).
You still need to actually create the correct json reference file
if you need a manifest, someone has to make this, and kerchunk already can for a range of cases. xarray can do it for a set of already-open datasets but it cannot persist the overall dataset state. Do we need a third way?
I would think that the kerchunk.Array
or concatenated zarr.Array
inside the xarray object could be accessed, which would then create the manifest. So the underlying manifest creation is still handled by kerchunk (or zarr if this logic ends up there), but one could use the xarray api to do the manipulation.
If this is possible, I think it would be a very nice composition of library functionality.
would think that the kerchunk.Array or concatenated zarr.Arrayinside the xarray object could be accessed, which would then create the manifest. So the underlying manifest creation is still handled by kerchunk (or zarr if this logic ends up there), but one could use the xarray api to do the manipulation.
Could you perhaps describe in more detail how you see the pieces fitting together? I am having trouble envisioning this.