kerchunk
kerchunk copied to clipboard
SpatioTemporal Asset Catalogs STAC
How does this compare to what STAC is trying to achieve?
Or maybe this is a tool which can work together with STAC?
A quick search in your documentation and issues and there is no mentioning of STAC. Maybe you haven't heard of it? https://stacspec.org github of STAC specs
@rsignell-usgs and I talked a little about how kerchunk and STAC interact and that they are mainly complementary technologies rather than competitive ones.
Mainly kerchunk and STAC are working at slightly different levels of the data access stack. A STAC catalog is more comparable to an intake catalog, while kerchunk might be used as a STAC item or collection asset.
Say you have a STAC collection of daily forecasts, each STAC item might be sourced from a NetCDF file and would have a corresponding NetCDF assets. You may initially use the datacube and xarray extensions to help describe the item, collection, and asset to help searching and provide xarray some assistance in knowing the dimensions without loading the underlying NetCDFs.
Where STAC & the extensions doesn't help, is that if you are only interested in a single variable, region, or slice of time within that NetCDF xarray will still have to attempt to load the file directly. Depending on the NetCDF structure this may require the entire file to be read.
Using the SingleHdf5ToZarr
class, you could generate an kerchunk JSON per item with the metadata for accessing subsets of each NetCDF. The kerchunk JSON provides the offset and encoding information that allows xarray (or other tools) to access just the subsets requested from the NetCDFs.
To help users that may be querying across STAC item boundaries, the item kerchunk JSONs can be aggregated with MultiZarrToZarr
which would be attached as an asset to the parent collection. Xarray can then open the aggregated kerchunk JSON from the collection and slice and dice across even more dimensions while reducing the up-front and ongoing data transfer.
I'll reply specific-to-general :)
A kerchunke'd dataset is essentially an xarray dataset for all of the workflows we have tried so far It might be made from parts of one input of many inputs, and laid out however the author of the references specified.
A STAC is a set of datasets. Indeed, this is more like Intake, but more domain specific (so that you can, for example, query against some geometry on the earth) whereas Intake is more general (multiple catalogue sources, data types and query systems). Both support some idea of hierarchy, specification of relationships between contents and linking.
We have, somewhere, talked about trying to define the dividing line between hierarchical catalogue (STAC, Intake) and hierarchical data format (zarr, hdf), both of which have metadata/attributes; the border is blurry! For the specific case of netCDF-like data, which is the xarray case, the in-format hierarchy usually disappears.
More specifically, though, kerchunk is in principle much more general that hdf->zarr. We can isolate binary blocks in (compressed) CSV/json with embedded newlines; we can build directory trees of parquet files to assign attributes to certain partitions. This functionality does not exist yet, but kerchunk should be useful in many places that have nothing to do with "spatio-temporal" data, even if most of the pangeo-related people will not be as interested in those.
Ok, Thanks for the clarification. Would be great if STAC and kerchunk work together. They are both quite new technologies, so it might still be easy to align them.
I actually don't know how STAC works internally, but it can open zarr somehow, so may be it's not hard to get it to use the extra arguments and code required to make kerchunk'd datasets to load. On the other hand, we already have Intake (which can also load from STAC, of course), so do you need more? Maybe the answer depends on whether this whole idea will ever get any uptake outside of python.
As @martindurant said, we already know how to create Intake catalog items for kerchunked datasets (what we're calling fileReferenceSystem datasets even if they aren't created with kerchunk).
And since we have intake-stac, we just have to figure out what to put in the STAC catalog to make intake-stac create the right intake catalog.
@abkfenris , perhaps you have already done this?
I believe that a STAC item can include storage_parameters
for zarr (@TomAugspurger ?), so kerchunked (reference) datasets may just work already - but first we need to extract the time and positional bounds as metadata. Of course, this makes the entries specific to python/fsspec, so it's not actually achieving what STAC is supposed to be for, and may as well have been described specifically for Intake, which we already understand and we know works.
Disclaimer that I only started looking at Kerchunk last week, but I agree with @abkfenris that they're complementary.
We're likely to use them together for datasets where the data provider provides a NetCDF file per (variable, year)
. We'll have a STAC item for each NetCDF file.
Then we'll use kerchunk to combine those into a single reference filesystem, and add a collection-level asset pointing to the reference file system. When combined with the xarray-assets extension, the access pattern for the whole combined dataset will be:
catalog = pystac.read_file("my-catalog")
collection = catalog.get_collection("my-collection")
asset = collection.assets["my-dataset"] # includes a link to the reference filesystem
ds = xr.open_dataset(asset.href, **asset.properties["xarray:open_kwargs"])
Very useful 'reference' discussion to learn about what is what.
@TomAugspurger I am also trying to combine Kerchunk and STAC, since we have yearly netCDFs of weather related variables that we don't want to convert to COGs anymore.
My current workflow consist in:
- Load the netCDF in an S3 public bucket to make it accessible
- Create a Kerchunk file for each netCDF (containing only one variable)
- Create a STAC Collection based on them
- Create one STAC Item for each time range (in this case it's one year) with as many STAC Assets as the number of variables/netCDFs
- The file referenced in the Assets will be the Kerchunk json file.
Now I'm trying to write a code to reload the data from it.
I attach here the sample files:
STAC Collection: https://gist.github.com/clausmichele/879e22addeee9e417c4ebe81f03f6773 STAC Items: https://gist.github.com/clausmichele/28efa0007731044db3a7752da2164fe0 https://gist.github.com/clausmichele/6b78a70ef153c4c841401ec0b7d2b75f
What do you think about it? Would be better to include the xarray-assets extension to simplify the data loading?
https://github.com/stac-utils/xpystac might be worth considering.
Over time, I've grown a bit hesitant to put too much Python / xarray-specific logic in the assets themselves, mainly because APIs can change over time. xpystac comes at it from the other way: it knows how to load a bunch of types of STAC data into xarray containers.
See https://github.com/stac-utils/xpystac/issues/34 and linked issues for some discussion on Kerchunk specifically.