zarr-specs
zarr-specs copied to clipboard
Manifest storage transformer
This issues describes a concept for a Zarr v3 Storage Transformer to enable generic indirection between the Zarr keys and the name of the underlying objects in a store. It is not a new idea (see below) but this design is meant to cover a broader set of use cases.
Goals
- Enable content-addressable storage schemes (see #82 for early proposal)
- Enable stores that reference bytes created outside Zarr (e.g. Kerchunk)
- Enable static snapshots of stores (https://github.com/zarr-developers/zarr-specs/issues/154)
- Enable concatenating of multiple arrays without copying any chunk keys (https://github.com/fsspec/kerchunk/issues/377#issuecomment-1765449991)
- Enable creating Zarr stores that are a mix of "reference" arrays (i.e. Kerchunk) and native Zarr arrays
Design
There has been a lot written on this subject already (see issues linked above) so I'm going to attempt to jump straight into the design. The key difference between this design and prior proposals is that the manifest will be local to the Array. The reason for this is to increase the scalability, portability, and composability of the manifest concept.
Store layout
The manifest store layout will resemble that of a regular Zarr V3 store. Consider the following directory store representation:
a/zarr.json <- group metadata
a/foo/zarr.json <- array metadata
a/foo/manifest.json <- array manifest
...
b/baz/zarr.json <- array metadata
b/baz/c/1/1 <- "regular" chunk
...
Note: array a/foo is a manifest array but array b/baz is a regular zarr array.
Array metadata
Manifest style arrays will need to declare a storage transformer configuration:
{
"node_type": "array",
...
"storage_transformers": [
{
"name": "chunk-manifest-json",
"configuration": {
"manifest": "./manifest.json"
}
}
]
}
Note: the small manifests could also be inlined directly into the array metadata object.
Manifest object
In my example above, the array a/foo includes a manifest object (a/foo/manifest.json) which will store the mapping of chunk keys to keys in the store:
{
"0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
"0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
"0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
}
path would be the only required key, offset/length/checksum/etc could all be added keys to a) inform the store how to fetch bytes from the chunk or b) provide the store with additional metadata about the chunk.
Note 1: Kerchunk also supports inline data in place of the path. That could also be supported here. Note 2: I'm using JSON as a manifest type here, but many other options exist, including Parquet or even Zarr arrays.
Concatenating arrays:
Edit: Feb 6 7:20p PT - After thinking about this more, I'm beginning to think serialization of concatenated arrays is a trickier problem than should be addressed in the initial iteration here. The main tricky bit is how to combine arrays with compatible dtypes/shapes/chunks but with differing codecs. Details from my original ideas below but consider this redacted from the proposal for now.
Details
One of the goals above is to enable concatenating multiple Zarr arrays. The manifest approach supports a zero-copy way to achieve this. The concept here closely resembles the approach from [Kerchunk's MultiZarrToZarr](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), except it targeting individual arrays and could be made to work with any zarr arrray (not just Kerchunk references). The idea is that concatenating arrays can be done in Zarr, provided a set of constraints are met, by simply rewriting the keys. Implementations could provide a API for doing this concatenation like:
arr_a: zarr.Array = zarr.open(store_a, path='foo') # shape=(10, 4, 5), chunks=(2, 4, 5)
arr_b: zarr.Array = zarr.open(store_b, path='bar') # shape=(6, 4, 5), chunks=(2, 4, 5)
arr_ab: zarr.Array = zarr.concatenate([arr_a, arr_b], axis=0, store=store_c) # shape=(16, 4, 5), chunks=(2, 4, 5)
In this example, zarr.concatenate would act similar numpy.concatenate, returning a new zarr.Array object after creating the new manifest in store_c. This could also be done in two steps by adding a save_manifest method to the Zarr arrays.
Possible extensions
I've tried very hard to keep the scope of this as small as possible. There are currently few v3 storage transformers to emulate so I think the best next step is to try out this simple approach before spending too much time on a spec or elaborating on future options. That said, there are some obvious ways to extend this:
- Supporting writes to manifest arrays (possible, there are many edge cases to consider)
- Enable content addressable storage by hashing keys during writes
- Support non-JSON manifests (many options)
Props
🙌 to those that have done a great job pushing this subject forward already: @martindurant, @alimanfoo, @rabernat among others.
In this proposal, what type of thing is arr_ab?
In this proposal, what type of thing is arr_ab?
All three are zarr.Arrays. I'll add some clarification.
Can we write to zarr_ab? zarr_ab[0, 0] = 1?
after creating the new manifest in
store_c
I think we should seriously consider a much lighter-weight concatenation method. What about just storing references to store_a and store_b, rather than duplicating the whole manifest? Basically how ncml works.
The advantages of this are that
- It doesn't require a chunk manifest. It works with vanilla Zarr arrays.
- It allows concatenation of arrays with different codecs and chunk sizes
- For arrays with manifests, it doesn't require duplicating all of the references
The metadata doc would somehow contain pointers to the other metadata docs. Something like
"concatenation": {
"axis": 0,
"arrays": ["../foo", "../bar"]
}
The one part I can't quite see is how to do the references to the arrays. Some sort of URL syntax? Absolute vs. relative paths?
Another way of putting it is that I think perhaps "chunk manifest" and "virtual concatenation of Zarr arrays" should be completely separable and orthogonal features.
Note that the kerchunk method and its child here already allow for content-addressable storage, e.g., IPFS. Not sure if you meant something beyond that. There has been chatter elsewhere of chunk checksums and such (stored in metadata, not the bytes of the chunk).
For the concatenation, I would want special attention paid to the multi-dimension case. Also, some consideration of groups-of-arrays which are concatenated together would be nice, but you might say that this is an xarray concern. Are you at all considering that the array chunk grid not aligning with the chunks?
Do I understand that you imagine an output metadata structure of the main "these are the arrays" and then a JSON for each of the target arrays? Or do you end up concatenating the reference lists somewhere along the way?
One important possible extension to consider along with those given - after a prototype is established - is that we now have a way to pass per-chunk information (analogous to the "context" I fought for), and so can have different behaviours for each chunk, like a different zero point in offset-scale filtering.
Another way of putting it is that I think perhaps "chunk manifest" and "virtual concatenation of Zarr arrays" should be completely separable and orthogonal features.
I've come around on this, but not for exactly the same reason. I've now redacted my original proposal which was not 100% thought though.
Note that the kerchunk method and its child here already allow for content-addressable storage, e.g., IPFS. Not sure if you meant something beyond that.
Certainly some parallels here but this could be done without IPFS. @alimanfoo's proposal in #82 is still a good read, despite using some now-outdated vernacular.
For the concatenation, I would want special attention paid to the multi-dimension case. Also, some consideration of groups-of-arrays which are concatenated together would be nice, but you might say that this is an xarray concern. Are you at all considering that the array chunk grid not aligning with the chunks?
Again, I'm going to remove this from the proposal. But I'll just say that there are some parallels with @d-v-b's proposal to "fix zarr-python's slicing" (https://github.com/zarr-developers/zarr-python/discussions/1603, https://github.com/zarr-developers/zarr-python/issues/980) - namely the creation of a lazy Zarr Array or ArrayView that wraps one or more Zarr array. If we take serialization off the table for now, we can think of this outside the spec conversation and explore how to address this at the implementation level.
Do I understand that you imagine an output metadata structure of the main "these are the arrays" and then a JSON for each of the target arrays? Or do you end up concatenating the reference lists somewhere along the way?
I was thinking of concatenating the references but have walked this back because you have to enforce that all array metadata is equivalent (e.g. codecs) for all concatenated arrays. @rabernat is suggesting another approach with could work to resolve those concerns.
This is very similar to the kerchunk Reference File System format but is not exactly the same JSON format: https://fsspec.github.io/kerchunk/spec.html
There are also at least a few implementations of the kerchunk json format outside of kerchunk itself:
- https://github.com/manzt/reference-spec-reader
- https://github.com/ksharonin/kerchunkC/
Would it be advantageous to use exactly the same format?
a few implementations of the kerchunk json format outside of kerchunk
Can you please put references? They might be useful for inspiration.
a few implementations of the kerchunk json format outside of kerchunk
Can you please put references? They might be useful for inspiration.
I updated my comment to include one other known implementation.
@martindurant Is there a document that describes the kerchunk parquet format?
No, but I could make one.
While we can all assume what s3:// means, in order for this to be fully specified, we also need to specify the meaning of the URLs. See https://github.com/zarr-developers/zeps/pull/48 for one proposal regarding URLs I created, but something more limited could also suffice.
Another issue to consider is the Confused deputy problem: user A might think they are writing to "s3://someone-elses-bucket/path" but actually end up writing with user A's credentials to "s3://user-a-private-bucket/other/path". Similarly, user A may think they are exposing "s3://someone-elses-bucket/path" over an HTTP server but actually end up sharing data from "s3://user-a-private-bucket/other/path" or "file:///etc/passwd".
No, but I could make one.
I think that would be very helpful.
@jbms - I have a few answers to your question of "why not use the kerchunk format":
- Kerchunk represents the entire store as a single manifest, my position is that splitting manifests into separate arrays will have significant benefits
- Kerchunk's JSON schema has some idiosyncrasies that make it difficult to use as a generic manifest - entries in the manifest are either a
strorList[str, int, int]. The JSON schema describe above would be more extensible to future metadata (e.g. optional checksums).
@rabernat - missed your first comment:
Can we write to zarr_ab? zarr_ab[0, 0] = 1?
Perhaps! I have not covered this use case yet above but it could be possible. It would be tricky to update the manifest in a consistent way across multiple updates. I suggest we treat arrays with manifest storage transformers as read-only for this initial conversation.
Kerchunk represents the entire store as ...
kerchunk is amenable to change :). Especially if it can also maintain compatibility.
@jbms - I have a few answers to your question of "why not use the kerchunk format":
- Kerchunk represents the entire store as a single manifest, my position is that splitting manifests into separate arrays will have significant benefits
I can see that there are advantages to splitting but I think that is mostly orthogonal to the issue of the metadata format.
- Kerchunk's JSON schema has some idiosyncrasies that make it difficult to use as a generic manifest - entries in the manifest are either a
strorList[str, int, int]. The JSON schema describe above would be more extensible to future metadata (e.g. optional checksums).
Yes there are some idiosyncrasies and I suppose kerchunk also assumes URLs are fsspec-compatible. Still given that it is designed to address essentially exactly the same thing as kerchunk, I think it would be desirable to avoid fragmentation if possible. Particularly since there is mention of not just a json format but also a parquet format, which kerchunk also has. Maybe Martin is open to evolving the format used by kerchunk? On the other hand given the nature of these manifest formats it is relatively easy to support multiple formats since you can just convert one to the other when you load it.
Martin is open to evolving the format used by kerchunk?
Yes, of course: we want everything to work well together. In the current design, I suppose it's already possible to "concatenate" a kerchunk-zarr with a normal zarr. (actually, kerchunk can also reference a zarr, so something like this was already possible on v2)
Also worth pointing out that kerchunk's current implementation has some specific v2 stuff in it, so something will have to change for v3 no matter what.
As I see it, this "manifest" format could be used as a key-value store adapter independent of zarr entirely, as a transparent layer below zarr that is not explicitly indicated in the zarr metadata (i.e. as kerchunk is currently used), or as a storage transformer explicitly indicated in the zarr metadata.
Re concatenation: I think as has been discussed that is not especially a practical use case even with variable-size chunks and instead we could discuss a solution for that independently, e.g. an explicit "concatenation" / "stack" extension for zarr. See this support in tensorstore for constructing virtual stacked/concatenated views (https://google.github.io/tensorstore/driver/stack/index.html).
One thing that would likely be important for concatenation is the ability to specify "cropping" and other coordinate transforms -- for that the "index transform" concept in tensorstore may be relevant to consider: https://google.github.io/tensorstore/index_space.html#index-transform
I realized my last answer may have unintentionally come off as critical of the Kerchunk project. Apologies is it came across that way. Kerchunk (@martindurant) has done us all a great service by showing us what is possible here. My point above was really trying to look forward and mesh the ideas Kerchunk has introduced with the Zarr storage transformer framework. And at the same time, opening some doors for additional extensions beyond those of the Kerchunk project.
Based on @martindurant's comments, it sounds like there is plenty of room to work together on, what could be, a new spec complaint storage layout for Kerchunk.
I realized my last answer may have unintentionally come off as critical of the Kerchunk project.
Not at all, that's why we have these conversations. We already have redundant code for "view set of datasets" from xarray and dask, which have particular views on what arrays are and how they work.
I will say, though, that kerchunk aims to work beyond the netCDF model alone (xr trees to start, but more complex zarr group trees too) and even beyond zarr (e.g., from the simplest, supermassive compressed CSV with embedded quoted fields, to making parquet directory hierarchies and assembling feather 2 files from buffers). Whether those ideas are worth pursuing remains to be seen, but I expect there will always be some bespoke combine. logic in the kerchunk repo.
it sounds like there is plenty of room to work together on, what could be, a new spec complaint storage layout for Kerchunk.
Yes, from the combine user API to reference storage formats and more.
@jhamman what is the motivation for requiring the path key? We've run into a lot of issues related to determining whether a chunk is missing because it is entirely comprised of the fill_value or something going wrong during data production. Allowing all keys for a given chunk reference to be absent could provide a nice intermediate solution in that chunks could be explicitly defined as empty in the manifest but implicitly missing in the zarr store for space savings on sparse arrays. The space savings in the manifest itself seem minimal relative to convenience in identifying and verifying missing chunks, but I'm curious what factors I might be missing for this decision.
{
"0.0.0": {"path": "s3://bucket/foo.zarr/precipitation/0.0.0"},
"0.0.1": {"path": "s3://bucket/foo.zarr/precipitation/0.0.1"},
"0.1.0": {},
"0.1.1": {"path": "s3://bucket/foo.zarr/precipitation/0.1.1"}
}
Allowing all keys for a given chunk reference to be absent
@maxrjones FYI see https://github.com/TomNicholas/VirtualiZarr/issues/33#issuecomment-2000529283 for a related discussion about the same issue but for the in-memory ChunkManifest.
Hey, throwing out another "manifestation" :laughing: of this Manifest Storage Transformer idea. It is essentially what @jhamman has proposed. Please ignore the naming because it was created before what is now known as Zarr Sharding. "Manifest Storage Transformer" is a great name. Another one could be "Composite Store".
There is a JSON manifest of other stores and their associated path for a group or array dimension. The configuration / schema of that manifest is ad-hoc based on the zarr-python store construction, but it would be better to standardize on something like what @jbms proposed in ZEP 8.
What is neat is that it demonstrates how simple an implementation can be and that it can also be reasonably performant. It uses python dictionaries / hash maps for fast look-up, but it should be easily adapted to other languages.
@thewtex If I understand correctly, you are proposing that the "manifest", in addition to mapping individual keys to URLs, could also map key prefixes (or more generally, arbitrary key ranges) to URL prefixes.
I would definitely support that addition.
By defining it in terms of arbitrary key prefixes / key ranges, it doesn't need to be specific to zarr at all.
@jbms yes, you are right. Simplicity could be helpful and powerful here.
From my perspective, there are three big wins:
- The ability to scale to extremely large aggregate stores (many environments have 32 GB, etc. limits).
- The ability to transform components. In the test implementation there is a
map_shardsfeature. From the perspective of creating zarrs, this supports the workflow: 1) write part of the dataset in parallel to a local directory store. 2) do some conditioning on that store, like re-chunk, re-encoding, transforming into a zip store, etc. 3) migrate from local to remote storage. - Support content-addressed storage like IPFS, where the Merkel tree can be broken out into multiple smaller Merkel trees and this higher-level manifest.
If I understand correctly, you are proposing that the "manifest", in addition to mapping individual keys to URLs, could also map key prefixes (or more generally, arbitrary key ranges) to URL prefixes.
I'm not sure I understand what this means. Can someone give a concrete example?
@jhamman How hard would it be to support appending to one dimension of a chunk manifest? People are asking for that feature in VirtualiZarr (https://github.com/TomNicholas/VirtualiZarr/issues/21), and I could imagine a neat interface like xarray's ds.to_zarr(store, append_dim=...), where ds contains ManifestArray objects. But I'm not sure if trying to overwrite the manifest.json after it's been written might create consistency issues...? I guess maybe it's not that different to the overwriting of zarr array metadata that must already happen in to_zarr when appending?