kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Refactor file format backend openers

Open TomNicholas opened this issue 1 year ago • 1 comments

Problem

The API for Kerchunk's file format backend openers doesn't follow a consistent pattern.

Suggestion

Change the openers to each be a function returning a VirtualZarrStore (see #375), with standardized keyword arguments.

Advantages

  • Neater
  • Can any common standardize keyword arguments
  • Can do validation within file opening or immediately afterwards
  • Would make it more obvious how to add a new opener for a new file type
  • Could also allowing implementing a general file opener (like is done in pangeo-forge)

Implementation ideas

  • Perhaps each opener should inherit from a single abstract method?
  • Should there be some arguments that are valid for every backend (e.g. inline_threshold), and others that are specific to particular backends?

Questions

How to handle GRIB files? Combine before returning? Return as a hierarchy of multiple groups within a single store (like when opening with datatree)? Or return as list of VirtualZarrStores?

TomNicholas avatar Oct 16 '23 20:10 TomNicholas

I would first point out that there is a little bit of consistency injected via classes that call functions, e.g., kerchunk.grib2.GribToZarr is a class designed to feel similar to kerchunk.hdf.SingleHdf5ToZarr.

A general file dispatch system seems reasonable, possibly something that belongs in Intake 2 (which already tries to guess file types by URL pattern matching or reading magic bytes). We probably don't want to replicate work in pangeo-forge, though?

Should there be some arguments that are valid for every backend (e.g. inline_threshold), and others that are specific to particular backends?

There are definitely operations that will be the same for all backends, like inlining.

On virtual zarrs, this sounds something between https://github.com/nsidc/earthaccess/pull/278 and a special xarray engine="scan-kerchunk". The trouble is, as with everything kerchunk, is that there are many options (such as what to do with gribs...) and it becomes hard to specify them all in a reasonable way. Not all of kerchunk will be xarray friendly (and maybe not even zarr).

  • do we need to be strict about the steps taken to make reference sets, or will this always be ad-hoc for the heterogeneity out there? This is what pangeo-forge recipes do, or an intake pipeline could, but there are tradeoffs.

martindurant avatar Oct 18 '23 14:10 martindurant