cubed icon indicating copy to clipboard operation
cubed copied to clipboard

WIP: Add virtual-rechunk example

Open thodson-usgs opened this issue 1 year ago • 20 comments

Rechunk a virtual dataset

This example demonstrates how to rechunk a collection of necdf files on s3 into a single zarr store.

First, lithops and VirtualiZarr construct a virtual dataset comprised of the netcdf files on s3. Then, xarray-cubed rechunks the virtual dataset into a zarr. Inspired by the Pythia cookbook by @norlandrhagen.

STATUS

I'm pretty sure I got this workflow to work, albeit slowly; however, now I'm getting a new AttributeError. Details below.

PLANNING

Rechunking has been a thorn in the side for many of us, and I think there's general interest in a serverless workflow. It remains to be seen whether this example should live as part of cubed or as part of a pangeo community of practice. Once this example is working again, the next two steps are:

  1. Increase the chunk size to ~100MB, which might involve finding a better demo dataset. The demo chunks are currently too small, which is not performant.
  2. Explore how difficult it would be to alter cube's rechunk algorithm such that each work writes multiple chunks, just as rechunker does.

thodson-usgs avatar Jul 25 '24 15:07 thodson-usgs