Add in-memory arrays to `ManifestStore`
tl;dr: I think we should add in-memory arrays to ManifestStore, probably by generalizing the wrapped type to ManifestArray | np.ndarray.
Context
This library has evolved from managing only "virtual" xarray datasets into a design where ManifestStore really should be the primary entity, with xarray as an optional dependency. Icechunk and Kerchunk are already optional dependencies, and ManifestStore should be able to stand alone as independently useful.
However, ManifestStore is currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:
| - | ManifestStore |
"virtual" xarray.Dataset |
Icechunk format | Kerchunk format |
|---|---|---|---|---|
| external references | ManifestArray |
xr.Variable wrapping ManifestArray (a so-called "virtual variable") |
virtual chunk ref | kerchunk reference stored in JSON |
| "inlined" references | ❌ | xr.Variable wrapping (lazy) np.ndarray (a so-called "loadable variable") |
native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail) | encoded bytes stored in JSON |
This causes multiple problems:
- We can't roundtrip
kerchunk -> ManifestStore -> virtual xarray.Dataset -> kerchunkif the original kerchunk file contains inlined references. (#489) - Similarly it limits the utility of proposed
.to_icechunk/.to_kerchunkmethods onManifestStore, as they can't handle all the cases thatds.vz.to_icechunk/kerchunkcan. (#591). Withhout this xarray can't really be an optional dependency. - It tempts us to hack around this using a memorystore (see https://github.com/zarr-developers/VirtualiZarr/issues/636), but I think this is the wrong design...
Proposal: Add numpy.ndarray as a wrapped type in ManifestStore
Currently ManifestStore holds ManifestGroups, which have ._members: Mapping[str, "ManifestArray | ManifestGroup"]. If we generalize this to ._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"] we can store inlined chunks in memory in the ManifestStore as numpy arrays.
We would need to ensure that xr.open_dataset(ManifestStore) could still read these arrays, which might require something clever during ManifestStore.get.
vz.open_virtual_dataset (really just a backwards-compatibility wrapper around ManifestStore.to_virtual_dataset()) also needs to know that any array stored as a numpy array is to become a loadable variable.
This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense - open_virtual_dataset has a problem in that its' loadable_variables kwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things.
Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.
Alternative considered (but I vote against this): Use obstore.MemoryStore
In #636 and #794 it was suggested that we store inlined chunks in the ManifestStore by adding obstore.MemoryStore to the ObjectStoreRegistry and referencing the bytes via virtual chunk references of the form memory://{location}. This is neat in that you can read the bytes back directly using obstore and zarr-python.
The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that ObjectStoreRegistry can be any obstore.Store, and we treat all of them identically, no matter their contents. But we would instead be special-casing obstore.MemoryStore compared to other obstore Store types. We also want to be able to treat all icechunk stores identically (note Icechunk has in_memory_storage and supports virtual references beginning with memory:// too). Special-casing memory:// or some namespace within memory:// breaks this abstraction.
Some other problems special-casing MemoryStore would cause:
- When calling
ManifestStore.to_icechunkon an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish amemory://reference that represents an inlined chunk from amemory://reference that represents a virtual chunk? - How would we be sure to avoid name collisions inside the
memory://namespace? #794 does this by having the kerchunk parser write inlined data into"memory://kerchunk/", but this seems like a leaky abstraction. That idea also means that theManifestStorecontents depend on the parser they were created by, which is also leaky.
cc @maxrjones @sharkinsspatial @ilan-gold @paraseba
I like the idea. I think the implementation simpler with this modification:
| - | ManifestStore |
"virtual" xarray.Dataset |
Icechunk format | Kerchunk format |
|---|---|---|---|---|
| external references | ManifestArray wrapping ChunkManifest |
xr.Variable wrapping ManifestArray (a so-called "virtual variable") |
virtual chunk ref | kerchunk reference stored in JSON |
| "inlined" references | ManifestArray wrapping nd.array |
xr.Variable wrapping (lazy) np.ndarray (a so-called "loadable variable") |
native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail) | encoded bytes stored in JSON |
in this case, we'd keep:
ManifestStore._members: Mapping[str, "ManifestArray | ManifestGroup"]
and modify ManifestArray to have:
_manifest: `ChunkManifest` | `np.ndarray`
_metadata: ArrayV3Metadata
This would keep all the objects in a ManifestStore compatible with Zarr's data model and prevent needing to infer array metadata from the ndarray properties inside store.get() at runtime.
Separately, I wonder if we can type this using the Array API rather than locking parsers into using NumPy objects.
https://github.com/zarr-developers/zarr-extensions/issues/22 is also relevant here (cc @d-v-b if you have thoughts)
prevent needing to infer array metadata from the
ndarrayproperties insidestore.get()at runtime.
I agree this would be nice.
However I don't think this proposal really works
_manifest: ChunkManifest | np.ndarray
because np.ndarray does not have an interchangeable API with ChunkManifest.
We could do something like
_manifest: ChunkManifest | InlinedChunk
where the new InlinedChunk class exposes the same API as ChunkManifest but also wraps a np.ndarray.
One important thing to decide is whether the in-memory representation is meant to be of one chunk or many chunks. In general we have been assuming that:
- An "inlined" or "loadable variable" is a single chunk
- A "virtual" zarr array consists of either all virtual chunks or all native chunks.
Neither Icechunk nor Kerchunk assume either of these things.