cubed icon indicating copy to clipboard operation
cubed copied to clipboard

Defer to merge_chunks in special cases of rechunk

Open TomNicholas opened this issue 2 years ago • 4 comments

#221 introduced merge_chunks, a special-case of rechunk that can be implemented using blockwise. I noticed that whilst reduction calls merge_chunks directly, inside ops.rechunk the primitive rechunk is always called. Shouldn't it be possible for ops.rechunk to check if the user is asking them to perform that special case, and internally dispatch to merge_chunks?

This also makes me wonder whether there are any other special cases of rechunk that could be written using blockwise.

TomNicholas avatar Jul 31 '23 17:07 TomNicholas

This should be possible, but I'm not sure how much this occurs in practice. The only calls to rechunk in Cubed are in reshape and from_array (for some cases).

tomwhite avatar Aug 01 '23 10:08 tomwhite

It will also happen if an xarray user calls .chunk on an already-chunked array, because it dispatches to cubed's rechunk method.

I agree it's not a very common case (though I expect it to come up in the full pangeo vorticity example where we pad then rechunk to merge the padded values back in).

TomNicholas avatar Aug 01 '23 21:08 TomNicholas

there are any other special cases of rechunk that could be written using blockwise. It will also happen if an xarray user calls .chunk on an already-chunked array

This is very confusing to me.

Isn't a rechunk without a on-disk intermediate not "blockwise" by definition (you are communicating across chunks)? I thought the optimization was effectively optimizing chunking when reading from an intermediate store by looking at the chunks needed for the succeeding operation. But perhaps I'm misunderstanding something.

dcherian avatar Aug 01 '23 21:08 dcherian

merge_chunks is implemented using map_direct, which

works by creating an empty array that has the same shape and chunk structure as the output, and calling map_blocks on this empty array, passing in the input arrays as side inputs to the function, which may access them in whatever way is needed.

That might resolve it for you @dcherian ?

Alternatively, the way I have been thinking about this merge_chunks operation is as just one-half of what rechunker does. In Tom A's original suggestion that led to rechunker, he breaks general rechunking into a split pass and a merge pass. If you can accomplish the specific rechunk only doing the merge pass you don't need the intermediate store.

(This also suggests that an equivalent split_chunks might also be possible to implement using map_direct)

TomNicholas avatar Aug 01 '23 22:08 TomNicholas