pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Pass filepaths to `MultiZarrToZarr`

Open keewis opened this issue 10 months ago • 4 comments

kerchunk.MultiZarrToZarr allows passing a function to coo_map, which among others can receive the filepath / url of the target file (but only if that information is available, otherwise it is set to None). This is useful if some information is only available in the filename (more often than not this appears to be the time).

As far as I understand it, OpenWithKerchunk opens the file and forwards the references dict (no writing to disk), so the filepath is consumed in that step. MultiZarrToZarr allows two keyword arguments in this case (paths and indicts), so I wonder if it would be possible have OpenWithKerchunk forward the filepaths to CombineReferences in addition to the references dicts?

For example, instead of returning the raw references in open_with_kerchunk, it might be possible to wrap both in a tuple / dict and have the combiner unpack it (or maybe beam has a specialized structure / pattern for something like this?)

keewis avatar Aug 28 '23 09:08 keewis

@keewis thanks for the thoughtful write-up. This sounds very reasonable and I'd welcome a PR implementing this. If you'd like to give this a shot, please let me know if there's any further information/context I can provide to help you get started.

cisaacstern avatar Aug 29 '23 19:08 cisaacstern

After some investigation this seems a bit trickier than I had anticipated. As far as I understand it, what happens in CombineReferences is that CombineMultiZarrToZarr (a CombineFn subclass) is repeatedly applied to subsets of the extracted references using beam.CombineGlobally. This means that there is no difference between the first combination and any subsequent combinations, while only for the former it makes sense to pass the filenames.

So I guess I'm not sure how to best implement this? If it helps, I can put my current state (which is nothing more than a hack, really) in a PR.

keewis avatar Aug 30 '23 10:08 keewis

@keewis it would be great if you could submit what you have as a PR! Happy to help out if I can at all.

norlandrhagen avatar Aug 30 '23 16:08 norlandrhagen

Apologies for abandoning this.

I no longer believe it is a good idea to resolve this here by passing along the urls, instead I think kerchunk should allow extracting additional data directly in the extraction function (in SingleHdf5ToZarr, for example). That way, there would be no need to access the urls again, so we wouldn't have to figure out how to pass it along (and also, the only thing to do here would be to expose the relevant options in OpenWithKerchunk).

See also the discussion on the pangeo discourse.

keewis avatar Feb 09 '24 12:02 keewis