kerchunk
kerchunk copied to clipboard
Refactor file format backend openers
Problem
The API for Kerchunk's file format backend openers doesn't follow a consistent pattern.
Suggestion
Change the openers to each be a function returning a VirtualZarrStore
(see #375), with standardized keyword arguments.
Advantages
- Neater
- Can any common standardize keyword arguments
- Can do validation within file opening or immediately afterwards
- Would make it more obvious how to add a new opener for a new file type
- Could also allowing implementing a general file opener (like is done in pangeo-forge)
Implementation ideas
- Perhaps each opener should inherit from a single abstract method?
- Should there be some arguments that are valid for every backend (e.g.
inline_threshold
), and others that are specific to particular backends?
Questions
How to handle GRIB files? Combine before returning? Return as a hierarchy of multiple groups within a single store (like when opening with datatree)? Or return as list of VirtualZarrStores
?
I would first point out that there is a little bit of consistency injected via classes that call functions, e.g., kerchunk.grib2.GribToZarr
is a class designed to feel similar to kerchunk.hdf.SingleHdf5ToZarr
.
A general file dispatch system seems reasonable, possibly something that belongs in Intake 2 (which already tries to guess file types by URL pattern matching or reading magic bytes). We probably don't want to replicate work in pangeo-forge, though?
Should there be some arguments that are valid for every backend (e.g. inline_threshold), and others that are specific to particular backends?
There are definitely operations that will be the same for all backends, like inlining.
On virtual zarrs, this sounds something between https://github.com/nsidc/earthaccess/pull/278 and a special xarray engine="scan-kerchunk". The trouble is, as with everything kerchunk, is that there are many options (such as what to do with gribs...) and it becomes hard to specify them all in a reasonable way. Not all of kerchunk will be xarray friendly (and maybe not even zarr).
- do we need to be strict about the steps taken to make reference sets, or will this always be ad-hoc for the heterogeneity out there? This is what pangeo-forge recipes do, or an intake pipeline could, but there are tradeoffs.