kerchunk
kerchunk copied to clipboard
Dataclass for "VirtualZarrStore"
Problem
Kerchunk user code currently passes around an obscure multiply-nested "reference dict" object. This is hard to read, interrogate, validate, or reason about.
Suggestion
Instead create a new VirtualZarrStore
dataclass, which contains all the same information that is currently stored in the reference dict but in a more structured manner. This would then be the principle object that gets passed around between user calls to kerchunk API.
Advantages
- Easier to read and interrogate than multiply-nested dicts
- Allows direct validation
- Serializes in obvious ways (via
.to_json
,to_parquet
,.to_dict
or similar.) - Easier to write tests, by using fixtures to generate
VirtualZarrStore
objects - Concentrates concerns over changes/enhancements to Zarr Spec in one class
- A v2->v3 converter could act directly on these objects
- Possibly easier to understand whenever anyone reimplements kerchunk in other languages?
Implementation ideas
- Implementation could subclass Zarr Object Model classes (where
.to_json
is analogous to the ZOM's.serialize
), which then would be solidified as the recommended abstract representation once ZEP006 is accepted - Can't use a bare ZOM class because we need to add some extra attributes for byte ranges etc. However information on where to find chunks is essentially a "Chunk Manifest", a generalizable idea that @jhamman has also been working on (for a nascent ZEP007??)
- Attributes of this dataclass need to always be serializable, so the
VirtualZarrStore
should be basically a json schema (see #373)
Questions
- Is it possible to do this in a broadly backwards-compatible manner?
I will want to spend some time thinking about this.
There are two objections that immediately come to mind:
- most operations within kerchunk work on the content of keys, so they will always be working at the dict level to directly set values. The mapper and zarr views necessarily prevent this.
- during combine, we now support writing directly to parquet. The interface is still store-like, but the access pattern is very different; so it's not a case of "build the dicts, then serialise to parquet", but "serialise to parquet on the fly" (in order to save memory).
So maybe it could be the other way around: the reference sets, dict-like stores, acquire .to_zarr and .to_mapper methods which use the information already contained within, but the primary representation is still dicts.