VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Rechunk method for uncompressed arrays

Open TomNicholas opened this issue 1 year ago • 4 comments

@rsignell this is inspired by your blog post (still a WIP for now)

The idea is that you can simply do

vds = open_virtual_dataset('uncompressed_netcdf3.nc')
subchunked_vds = vds.chunk(time=1)
  • [x] Closes #86
  • [x] Tests added
  • [x] Tests passing
  • [x] Full type hint coverage
  • [ ] Changes are documented in docs/releases.rst
  • [ ] New functions/methods are listed in api.rst
  • [ ] New functionality has documentation

TomNicholas avatar Jul 22 '24 16:07 TomNicholas

This now works in the sense that the .rechunk method on the ManifestArray class passes dedicated tests (and we can rechunk in however many dimensions we want!), but the new integration test fails because Dataset.chunk is dispatching to dask's version of .rechunk somewhere. This may require a change to xarray's ChunkManagerEntrypoint upstream to fix.

TomNicholas avatar Jul 23 '24 08:07 TomNicholas

Note to self: we should add more validation to the ZArray class to check that the chunks attribute is a tuple of positive integers, and move the zarray.replace call to the start of the method to catch invalid input early.

TomNicholas avatar Jul 23 '24 08:07 TomNicholas

we can rechunk in however many dimensions we want!

I love that aspect!

rsignell avatar Jul 23 '24 20:07 rsignell

but the new integration test fails because Dataset.chunk is dispatching to dask's version of .rechunk somewhere. This may require a change to xarray's ChunkManagerEntrypoint upstream to fix.

https://github.com/pydata/xarray/pull/9286 has now progressed far enough that this PR works for me at least (when using that xarray branch)! Passing all tests locally 🟢

TomNicholas avatar Jul 30 '24 00:07 TomNicholas