kerchunk Dataclass for "VirtualZarrStore"

Dataclass for "VirtualZarrStore"

Open TomNicholas opened this issue 1 year ago • 1 comments

Problem

Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.

Suggestion

Instead create a new VirtualZarrStore dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.

Advantages

Easier to read and interrogate than multiply-nested dicts
Allows direct validation
Serializes in obvious ways (via .to_json, to_parquet, .to_dict or similar.)
Easier to write tests, by using fixtures to generate VirtualZarrStore objects
Concentrates concerns over changes/enhancements to Zarr Spec in one class
A v2->v3 converter could act directly on these objects
Possibly easier to understand whenever anyone reimplements kerchunk in other languages?

Implementation ideas

Implementation could subclass Zarr Object Model classes (where .to_json is analogous to the ZOM's .serialize), which then would be solidified as the recommended abstract representation once ZEP006 is accepted
Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
Attributes of this dataclass need to always be serializable, so the VirtualZarrStore should be basically a json schema (see #373)

Questions

Is it possible to do this in a broadly backwards-compatible manner?

Oct 16 '23 20:10 TomNicholas

I will want to spend some time thinking about this.

There are two objections that immediately come to mind:

most operations within kerchunk work on the content of keys, so they will always be working at the dict level to directly set values. The mapper and zarr views necessarily prevent this.
during combine, we now support writing directly to parquet. The interface is still store-like, but the access pattern is very different; so it's not a case of "build the dicts, then serialise to parquet", but "serialise to parquet on the fly" (in order to save memory).

So maybe it could be the other way around: the reference sets, dict-like stores, acquire .to_zarr and .to_mapper methods which use the information already contained within, but the primary representation is still dicts.

Oct 18 '23 14:10 martindurant

kerchunk kerchunk copied to clipboard

Dataclass for "VirtualZarrStore"

Problem

Suggestion

Advantages

Implementation ideas

Questions

kerchunk
kerchunk copied to clipboard