pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

`WriteCombinedReference` should emit a `zarr.storage.FSStore` (like `StoreToZarr`)

Open cisaacstern opened this issue 11 months ago • 4 comments

StoreToZarr emits a singleton PCollection containing a zarr.storage.FSStore. WriteCombinedReference should as well.

This is very useful for designing pipelines that do something with the data once it's written, such as:

  • Validate it with some tests
  • Catalog it somewhere
  • etc.

In #590 I am relying on this feature of StoreToZarr to do integration testing, so I'm blocked by this from integration testing kerchunk stores in the same manner.

cisaacstern avatar Sep 08 '23 06:09 cisaacstern

This should be as simple as returning a zarr.storage.FSStore from this function

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/f094b024716485956d81b38c730aee2eaf6ee7da/pangeo_forge_recipes/writers.py#L95-L99

which is what WriteCombinedReference calls into here

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/f094b024716485956d81b38c730aee2eaf6ee7da/pangeo_forge_recipes/transforms.py#L464-L465

cisaacstern avatar Sep 08 '23 06:09 cisaacstern

I think this was addressed in the Parquet Kerchunk option PR: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/620/files. On main, write_combined_reference now returns an fsspec target.

norlandrhagen avatar Oct 02 '23 16:10 norlandrhagen

@norlandrhagen, looks like we're quite close but not all the way there: full_target is an pangeo_forge_recipes.storage.FSSpecTarget. The aim of this issue is to return a zarr.storage.FSStore. In practical terms, what we're aiming for is to be able to pass this store value directly into xr.open_dataset, i.e.

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/7e34030e6ea58d389641f7fa15dfb4e34be9467d/examples/feedstock/noaa_oisst.py#L24-L32

Naively, I initially hoped that simply returning full_target.get_mapper() would give us the return type we want, but I'm pretty sure this won't work, since the object returned by full_target.get_mapper() is a mapper to the directory in which the kerchunk references are stored, and not a mapper to the virtualized zarr filesystem that the references represent. To get the latter, I think we need to add a bit of logic (probably just into the body of write_combined_references), which does the sort of thing you illustrate in your Pythia example:

https://projectpythia.org/kerchunk-cookbook/notebooks/case_studies/GRIB2_HRRR.html#load-kerchunked-dataset

What we want to return from write_combined_references is the equivalent of the variable m in the Pythia code snippet linked above. Does that make sense?

cisaacstern avatar Oct 02 '23 16:10 cisaacstern

Ah totally, that makes sense!

norlandrhagen avatar Oct 02 '23 18:10 norlandrhagen