pangeo-forge-recipes icon indicating copy to clipboard operation
pangeo-forge-recipes copied to clipboard

Why are coordinates being chunked by default with `StoreToZarr`?

Open abarciauskas-bgse opened this issue 1 year ago • 4 comments

👋🏽 I'm using pangeo-forge to create some Zarr stores to test with rio_tiler's XarrayReader. I found that when generating image tiles, performance was significantly impact by the number of S3 calls when opening the dataset and this was because the coordinates were chunked. The code to reproduce this issue is here: https://nbviewer.org/gist/abarciauskas-bgse/c826a49a966f3c157626f45ea816330b

abarciauskas-bgse avatar Jul 29 '23 17:07 abarciauskas-bgse

Thanks for raising this, @abarciauskas-bgse.

Based on the notebook it looks like you are using a development version of pangeo-forge-recipes that predates the 0.10.0 release?

I don't have any specific reason to think this is fixed in 0.10.0, but could you confirm that the issue is present there as well?

cisaacstern avatar Jul 29 '23 19:07 cisaacstern

Thanks for looking @cisaacstern - I just installed pangeo-forge-recipes 0.10.0 and verified the coordinates are also chunked when using that version of the library (code linked in notebook has been updated to reflect this library version).

abarciauskas-bgse avatar Jul 29 '23 21:07 abarciauskas-bgse

Thanks for checking that, @abarciauskas-bgse. AFAICT, this feature did exist in 0.9.4 here:

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/3b3c13c358259a9b01367c91951d28826b6c252e/pangeo_forge_recipes/recipes/xarray_zarr.py#L665-L692

And re-implementing this as part of the Beam rewrite was simply overlooked. Adding this would probably mean defining a new transform


class ConsolidateDimensionCoordinates(beam.PTransform):
    ...

and slotting it in (optionally, perhaps, but as the default?) at the end of StoreToZarr, something like


 target_store = schema | PrepareZarrTarget( 
      target=self.get_full_target(), target_chunks=self.target_chunks 
 ) 
- return rechunked_datasets | StoreDatasetFragments(target_store=target_store)
+ stored_fragments = rechunked_datasets | StoreDatasetFragments(target_store=target_store)
+ return (
+     stored_fragments
+     if not self.consolidate_coords
+     else stored_fragments | ConsolidateDimensionCoordinates(target_store=target_store)
+ )

I could potentially take a look at this on the few weeksish time horizon from now, but would be thrilled if you or someone else wanted to take this on, and would be happy to do async and/or video walkthrough to get you (or whoever wanted to work on this) up to speed.

Thanks for catching this!

cisaacstern avatar Aug 04 '23 01:08 cisaacstern

Thank you @cisaacstern for looking into this!

abarciauskas-bgse avatar Aug 09 '23 00:08 abarciauskas-bgse