pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Reference metadata as a precursor for recipes

Open martindurant opened this issue 4 years ago • 3 comments

cf https://github.com/pangeo-forge/staged-recipes/pull/68

The fsspec-reference-maker method for making JSON reference files enables HDF5 (and others) to be loaded with fsspec's ReferenceFileSystem and zarr. You can have one JSON per input file, or merge them into aggregated datasets.

Since we have had various problems with h5py (hanging dask workers, probably during garbage collection), this would allow the rechunking part of a recipe to run smoothly, and also offer good parallelism/concurrent fetching as well as predictable serialisation. Furthermore, the reference files are useful in their own right (if the original data is in a cloud-friendly place and the original chunking is acceptable; or if only a small subset of the data is required).

Points contra:

  • the chunkwise view depends on random access in the original files, which may not be possible for some HTTP/FTP servers
  • it may be unclear if the merge/concat should happen in the reference creation stage or later in the recipe (I would tend to vote for the former, so we had better make that merging logic rock solid)
  • not all possible codecs in HDF5 are understood by zarr; we could implement them as required.

martindurant avatar Aug 12 '21 20:08 martindurant

(er, sorry, this should be in pangeo-forge-recipes ?)

martindurant avatar Aug 12 '21 21:08 martindurant

the chunkwise view depends on random access in the original files, which may not be possible for some HTTP/FTP servers

Martin, do I understand this correctly to mean that using a ReferenceFileSystem approach would replace the cache input execution step? Or perhaps it would optionally replace that step, but if the HTTP/FTP source doesn't permit random access, inputs could be cached first, as a workaround?

cisaacstern avatar Aug 13 '21 15:08 cisaacstern

Correct, in the general case there would be no need to store files locally; but for bad servers, there is probably no getting around it.

martindurant avatar Aug 13 '21 15:08 martindurant

This now exists 😄

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/9f9ec0eb0e1e21040b88c172097c98a198783adf/pangeo_forge_recipes/transforms.py#L189-L192

cisaacstern avatar Aug 24 '23 23:08 cisaacstern