Feature request: 'auto' in `target_chunks`
Hi,
Thanks for the great package! I'm currently using it in one of my projects to rechunk large symmetric matrices along a given axis. However, I'm missing a feature that I liked in Dask: automatically determining the chunk size for a given dimension. For example, say that I have the following use case:
import zarr
import dask.array as da
d = da.ones((10000, 10000))
d = d.rechunk({0: 'auto', 1: None})
d.to_zarr('my_store.zarr')
Is it possible to add a feature to accomplish the same thing in rechunker? I'm currently doing it the hacky way:
import psutil
import zarr
import dask.array as da
from rechunker import rechunk
d = da.ones((10000, 10000))
d.to_zarr('my_store.zarr')
z = zarr.open('my_store.zarr')
...
rechunked = rechunk(z,
target_chunks=d.rechunk({0: 'auto', 1: None}).chunksize,
target_store=target_store,
temp_store=intermediate_store,
max_mem=psutil.virtual_memory().available / psutil.cpu_count())
rechunked.execute()
Hope this makes sense. Thanks!
This is a great idea, and I'd love to support it.
One question: how does dask determine the chunk size in the 'auto' dimensions? Do we feel that the same logic is appropriate in rechunker?
If so, we can probably just reuse dask's normalize_chunks function to implement this.
I think the relevant function from Dask is here: https://github.com/dask/dask/blob/a988716cfeb3a9b1015d14a334368e70ae382553/dask/array/core.py#L2709
I believe it depends on a configurable limit on the size of the chunks config.get("array.chunk-size"), which can be easily incorporated into the rechunk function. Re-using normalize_chunks would also work fine, as it handles many other cases (e.g. -1 or None for some dimensions).
We would welcome a pull request if you feel comfortable trying to implement this yourself. 😊