rechunker icon indicating copy to clipboard operation
rechunker copied to clipboard

Incremental rechunk

Open davidbrochart opened this issue 5 years ago • 8 comments

rechunker solves a problem I was trying to solve in a much cleaner way, thanks a lot for working on that. I've tried on the GPM dataset and it seems to work fine. Do you know if it would work in an incremental mode? By that I mean that if I have already rechunked a part of a dataset, and want to continue later on, is it possible to rechunk only the remaining source and append that to the already rechunked destination?

davidbrochart avatar Jun 11 '20 14:06 davidbrochart

I'm glad this is helpful! 😄

Do you know if it would work in an incremental mode?

It should definitely be possible in principle. But not as currently implemented.

We are trying to release this soon with its current feature set. Once we stabilize the API a bit, we would be happy to have a PR that would add incremental support.

rabernat avatar Jun 11 '20 14:06 rabernat

Hi @davidbrochart. We have done a first release and have some decent docs up. It would be fantastic if you wanted to tackle the incremental case. What sort of API did you have in mind?

rabernat avatar Jul 17 '20 03:07 rabernat

Great @rabernat, I'll try and implement the incremental rechunking. As far as the API is concerned, we probably want to slice the source in order not to rechunk the whole dataset and restart from a different position. So for the initial rechunk we could have:

source = zarr.ones((4, 4), chunks=(1, 4), store="source.zarr")
intermediate = "intermediate.zarr"
target = "target.zarr"
rechunked = rechunk(source,
                    target_chunks=(2, 2),
                    target_store=target,
                    max_mem=256000,
                    temp_store=intermediate,
                    source_slice=((0, 2), (0, 4)))

And for the next rechunk we need to get the next slice and specify that it should be appended to the previous target:

rechunked = rechunk(source,
                    target_chunks=(2, 2),
                    target_store=target,
                    max_mem=256000,
                    temp_store=intermediate,
                    source_slice=((2, 4), (0, 4)),
                    target_append=True)

What do you think?

davidbrochart avatar Jul 17 '20 07:07 davidbrochart

I'm curious why we need the source_slice argument. It seems like we should be able to just pass a sliced array, no?

But I guess zarr may not support lazy slicing.

rabernat avatar Jul 17 '20 14:07 rabernat

But I guess zarr may not support lazy slicing.

Yes, I think if we slice the Zarr array we get an in-memory NumPy array.

davidbrochart avatar Jul 17 '20 14:07 davidbrochart

Thoughts on the API @TomAugspurger, @tomwhite?

rabernat avatar Jul 17 '20 14:07 rabernat

This feature will be very useful. The API looks good to me.

I briefly wondered if source_slice is needed at all, since in append mode only new data would be rechunked, but that's not safe if the source is being written to at the same time as being incrementally rechunked. So source_slice is needed. It should be optional though to support the non-incremental case.

tomwhite avatar Jul 17 '20 15:07 tomwhite

Also, even if the source is not being written, you may not want to rechunk the whole of it, because it can take a lot of time. Instead, you should be able to rechunk parts of it. It should be optional also in the incremental case, in which case the whole dataset should be rechunked.

davidbrochart avatar Jul 17 '20 21:07 davidbrochart