pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard
Reference metadata as a precursor for recipes
cf https://github.com/pangeo-forge/staged-recipes/pull/68
The fsspec-reference-maker method for making JSON reference files enables HDF5 (and others) to be loaded with fsspec's ReferenceFileSystem and zarr. You can have one JSON per input file, or merge them into aggregated datasets.
Since we have had various problems with h5py (hanging dask workers, probably during garbage collection), this would allow the rechunking part of a recipe to run smoothly, and also offer good parallelism/concurrent fetching as well as predictable serialisation. Furthermore, the reference files are useful in their own right (if the original data is in a cloud-friendly place and the original chunking is acceptable; or if only a small subset of the data is required).
Points contra:
- the chunkwise view depends on random access in the original files, which may not be possible for some HTTP/FTP servers
- it may be unclear if the merge/concat should happen in the reference creation stage or later in the recipe (I would tend to vote for the former, so we had better make that merging logic rock solid)
- not all possible codecs in HDF5 are understood by zarr; we could implement them as required.
(er, sorry, this should be in pangeo-forge-recipes ?)
the chunkwise view depends on random access in the original files, which may not be possible for some HTTP/FTP servers
Martin, do I understand this correctly to mean that using a ReferenceFileSystem approach would replace the cache input execution step? Or perhaps it would optionally replace that step, but if the HTTP/FTP source doesn't permit random access, inputs could be cached first, as a workaround?
Correct, in the general case there would be no need to store files locally; but for bad servers, there is probably no getting around it.
This now exists 😄
https://github.com/pangeo-forge/pangeo-forge-recipes/blob/9f9ec0eb0e1e21040b88c172097c98a198783adf/pangeo_forge_recipes/transforms.py#L189-L192