VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Support inlined Kerchunk data using obstore MemoryStore?

Open TomNicholas opened this issue 6 months ago • 3 comments

What is the path in the ChunkManifest for a inlined variable? Does it start with memory://?

No-one's ever made one, because we never fixed #489 😅 But I think we should, and I think memory:// would make sense as a prefix for that. To make that work I guess we would need the kerchunk parser to know that if it finds inlined data it should put that data into a MemoryStore and then create a chunk reference that refers to it?

that would work as a prefix for the ObjectStoreRegistry

That idea works in the sense that if I manually create and pass ObjectStoreRegistry({"memory://": memory_store}) then I can have the memory_store and optionally additional stores for actually getting referenced chunks.

But it doesn't work in the sense that if I do

memory_store = obstore.store.MemoryStore()
parser = KerchunkJSONParser(store_registry=None)
parser("refs.json", memory_store)

then the get_store_prefix call occurs before we get anywhere near creating a chunkmanifest, instead happening at a point that causes get_store_prefix to (incorrectly) have to guess which prefix to use.

I think the better solution to that would be to move the fs_root logic to be earlier in the parser, so that the local filepaths for chunks in the kerchunk references are disambiguated before get_store_prefix is called. This is consistent with what I said above - if the kerchunk parser finds inlined data it should create a chunk reference with a memory:// prefix, if it finds an ambiguous path like data.nc it should prepend it with fs_root. Then get_store_prefix will have enough information to work with.

But now I have a working example so I'm tempted to punt on that follow-up.

Originally posted by @TomNicholas in https://github.com/zarr-developers/VirtualiZarr/pull/631#discussion_r2168028921

TomNicholas avatar Jun 27 '25 14:06 TomNicholas

FYI I'm going to work on this feature later this week because it'd be helpful for using ManifestStore with some of the Kerchunk references generated as part of https://github.com/nasa/ASDC_Data_and_User_Services (cc @danielfromearth)

maxrjones avatar Aug 27 '25 18:08 maxrjones

@maxrjones I think we need to think about what happens when we serialize a memory:// virtual chunk to icechunk/kerchunk.

Currently IIUC such a chunk stays as a ManifestArray upon creation of the xarray virtual dataset, which means on vds.vz.to_icechunk() it gets written using .set_virtual_ref(). But that's a problem, because now we have a memory:// reference in the icechunk store, which is not portable. What we meant to do was write an actual native chunk using .set() (and let icechunk inline those bytes if it wants to).

There's presumably a similar problem when writing to kerchunk?

I think we need to special case memory:// when serializing, and for any memory:// chunks we have to write them individually as native chunks.

TomNicholas avatar Sep 05 '25 19:09 TomNicholas

We also might want ManifestStore.to_virtual_dataset() to automatically create in-memory numpy-array-backed Variables for any ManifestArrays containing only memory:// chunk references. But because an individual ManifestArray can contain a mixture of memory:// and non-memory:// chunk references, this wouldn't solve our serialization problem above in general.

TomNicholas avatar Sep 05 '25 19:09 TomNicholas