VirtualiZarr Add in-memory arrays to `ManifestStore`

tl;dr: I think we should add in-memory arrays to ManifestStore, probably by generalizing the wrapped type to ManifestArray | np.ndarray.

Context

This library has evolved from managing only "virtual" xarray datasets into a design where ManifestStore really should be the primary entity, with xarray as an optional dependency. Icechunk and Kerchunk are already optional dependencies, and ManifestStore should be able to stand alone as independently useful.

However, ManifestStore is currently lacking a way to represent "inlined" chunks. Comparing different representations of virtual/non-virtual chunks, this absence becomes conspicuous:

-	`ManifestStore`	"virtual" `xarray.Dataset`	Icechunk format	Kerchunk format
external references	`ManifestArray`	`xr.Variable` wrapping `ManifestArray` (a so-called "virtual variable")	virtual chunk ref	kerchunk reference stored in JSON
"inlined" references	❌	`xr.Variable` wrapping (lazy) `np.ndarray` (a so-called "loadable variable")	native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail)	encoded bytes stored in JSON

This causes multiple problems:

We can't roundtrip kerchunk -> ManifestStore -> virtual xarray.Dataset -> kerchunk if the original kerchunk file contains inlined references. (#489)
Similarly it limits the utility of proposed .to_icechunk/.to_kerchunk methods on ManifestStore, as they can't handle all the cases that ds.vz.to_icechunk/kerchunk can. (#591). Withhout this xarray can't really be an optional dependency.
It tempts us to hack around this using a memorystore (see https://github.com/zarr-developers/VirtualiZarr/issues/636), but I think this is the wrong design...

Proposal: Add `numpy.ndarray` as a wrapped type in `ManifestStore`

Currently ManifestStore holds ManifestGroups, which have ._members: Mapping[str, "ManifestArray | ManifestGroup"]. If we generalize this to ._members: Mapping[str, "np.ndarray | ManifestArray | ManifestGroup"] we can store inlined chunks in memory in the ManifestStore as numpy arrays.

We would need to ensure that xr.open_dataset(ManifestStore) could still read these arrays, which might require something clever during ManifestStore.get.

vz.open_virtual_dataset (really just a backwards-compatibility wrapper around ManifestStore.to_virtual_dataset()) also needs to know that any array stored as a numpy array is to become a loadable variable.

This also means parsers effectively would now have the power to decide to create "loadable variables". This actually makes some sense - open_virtual_dataset has a problem in that its' loadable_variables kwarg mixes concerns: sometimes you want the variable to be loaded so that it can be used for xarray indexing, sometimes you want it to be loaded as a storage optimization when you finally serialize the references out to kerchunk/icechunk. It's weird that we have one knob to control two things.

Note that storing as a numpy array directly means no zarr metadata for that variable. But I think that's fine - we don't store zarr metadata for loadable variables inside "virtual" xarray Datasets anyway.

Alternative considered (but I vote against this): Use `obstore.MemoryStore`

In #636 and #794 it was suggested that we store inlined chunks in the ManifestStore by adding obstore.MemoryStore to the ObjectStoreRegistry and referencing the bytes via virtual chunk references of the form memory://{location}. This is neat in that you can read the bytes back directly using obstore and zarr-python.

The reason I don't think this is a good idea is that it breaks the abstraction of the store. Our current design has the very neat property that ObjectStoreRegistry can be any obstore.Store, and we treat all of them identically, no matter their contents. But we would instead be special-casing obstore.MemoryStore compared to other obstore Store types. We also want to be able to treat all icechunk stores identically (note Icechunk has in_memory_storage and supports virtual references beginning with memory:// too). Special-casing memory:// or some namespace within memory:// breaks this abstraction.

Some other problems special-casing MemoryStore would cause:

When calling ManifestStore.to_icechunk on an icechunkstore that supports virtual references pointing at in-memory locations, how would we confidently distinguish a memory:// reference that represents an inlined chunk from a memory:// reference that represents a virtual chunk?
How would we be sure to avoid name collisions inside the memory:// namespace? #794 does this by having the kerchunk parser write inlined data into "memory://kerchunk/", but this seems like a leaky abstraction. That idea also means that the ManifestStore contents depend on the parser they were created by, which is also leaky.

cc @maxrjones @sharkinsspatial @ilan-gold @paraseba

Sep 19 '25 02:09 TomNicholas

I like the idea. I think the implementation simpler with this modification:

-	`ManifestStore`	"virtual" `xarray.Dataset`	Icechunk format	Kerchunk format
external references	`ManifestArray` wrapping `ChunkManifest`	`xr.Variable` wrapping `ManifestArray` (a so-called "virtual variable")	virtual chunk ref	kerchunk reference stored in JSON
"inlined" references	`ManifestArray` wrapping `nd.array`	`xr.Variable` wrapping (lazy) `np.ndarray` (a so-called "loadable variable")	native chunk ref (also potentially inlined into manifests, but this is an optimization that's an internal implementation detail)	encoded bytes stored in JSON

in this case, we'd keep:

ManifestStore._members: Mapping[str, "ManifestArray | ManifestGroup"]

and modify ManifestArray to have:

_manifest: `ChunkManifest` | `np.ndarray`
_metadata: ArrayV3Metadata

This would keep all the objects in a ManifestStore compatible with Zarr's data model and prevent needing to infer array metadata from the ndarray properties inside store.get() at runtime.

Separately, I wonder if we can type this using the Array API rather than locking parsers into using NumPy objects.

Sep 19 '25 11:09 maxrjones

https://github.com/zarr-developers/zarr-extensions/issues/22 is also relevant here (cc @d-v-b if you have thoughts)

Sep 19 '25 15:09 maxrjones

prevent needing to infer array metadata from the ndarray properties inside store.get() at runtime.

I agree this would be nice.

However I don't think this proposal really works

_manifest: ChunkManifest | np.ndarray

because np.ndarray does not have an interchangeable API with ChunkManifest.

We could do something like

_manifest: ChunkManifest | InlinedChunk

where the new InlinedChunk class exposes the same API as ChunkManifest but also wraps a np.ndarray.

One important thing to decide is whether the in-memory representation is meant to be of one chunk or many chunks. In general we have been assuming that:

An "inlined" or "loadable variable" is a single chunk
A "virtual" zarr array consists of either all virtual chunks or all native chunks.

Neither Icechunk nor Kerchunk assume either of these things.

Sep 26 '25 20:09 TomNicholas

Add in-memory arrays to `ManifestStore`

Context

Proposal: Add numpy.ndarray as a wrapped type in ManifestStore

Alternative considered (but I vote against this): Use obstore.MemoryStore

Proposal: Add `numpy.ndarray` as a wrapped type in `ManifestStore`

Alternative considered (but I vote against this): Use `obstore.MemoryStore`