pangeo-forge-recipes
pangeo-forge-recipes copied to clipboard
Pass filepaths to `MultiZarrToZarr`
kerchunk.MultiZarrToZarr
allows passing a function to coo_map
, which among others can receive the filepath / url of the target file (but only if that information is available, otherwise it is set to None
). This is useful if some information is only available in the filename (more often than not this appears to be the time).
As far as I understand it, OpenWithKerchunk
opens the file and forwards the references dict
(no writing to disk), so the filepath is consumed in that step. MultiZarrToZarr
allows two keyword arguments in this case (paths
and indicts
), so I wonder if it would be possible have OpenWithKerchunk
forward the filepaths to CombineReferences
in addition to the references dict
s?
For example, instead of returning the raw references in open_with_kerchunk
, it might be possible to wrap both in a tuple
/ dict
and have the combiner unpack it (or maybe beam
has a specialized structure / pattern for something like this?)
@keewis thanks for the thoughtful write-up. This sounds very reasonable and I'd welcome a PR implementing this. If you'd like to give this a shot, please let me know if there's any further information/context I can provide to help you get started.
After some investigation this seems a bit trickier than I had anticipated. As far as I understand it, what happens in CombineReferences
is that CombineMultiZarrToZarr
(a CombineFn
subclass) is repeatedly applied to subsets of the extracted references using beam.CombineGlobally
. This means that there is no difference between the first combination and any subsequent combinations, while only for the former it makes sense to pass the filenames.
So I guess I'm not sure how to best implement this? If it helps, I can put my current state (which is nothing more than a hack, really) in a PR.
@keewis it would be great if you could submit what you have as a PR! Happy to help out if I can at all.
Apologies for abandoning this.
I no longer believe it is a good idea to resolve this here by passing along the urls, instead I think kerchunk
should allow extracting additional data directly in the extraction function (in SingleHdf5ToZarr
, for example). That way, there would be no need to access the urls again, so we wouldn't have to figure out how to pass it along (and also, the only thing to do here would be to expose the relevant options in OpenWithKerchunk
).
See also the discussion on the pangeo discourse.