xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Implementing map_overlap

Open jakirkham opened this issue 6 years ago • 13 comments

Just as there are map_blocks and map_overlap methods for Dask Array, it would be useful to have equivalent methods for Xarray objects. This would make it easier to leverage duck typing to work with both Dask Arrays and Xarray objects.

Edit: Should add this came up a few times at the recent SciPy sprints.

jakirkham avatar Jul 18 '19 22:07 jakirkham

+1. The split_by_chunks method in this comment (https://github.com/pydata/xarray/issues/1093#issuecomment-259213382) would also be useful for more general per-chunk manipulation.

dcherian avatar Jul 18 '19 22:07 dcherian

That sounds somewhat similar to .blocks accessor in Dask Array. ( https://github.com/dask/dask/pull/3689 ) Maybe we should align on that as well?

jakirkham avatar Jul 18 '19 23:07 jakirkham

Another approach for the split_by_chunks implementation would be...

def split_by_chunks(a):
    for sl in da.core.slices_from_chunks(a.chunks): 
        yield (sl, a[sl])

While a little bit more cumbersome to write, this could be implemented with .blocks and may be a bit more performant.

def split_by_chunks(a):
    for i, sl in zip(np.ndindex(a.numblocks), da.core.slices_from_chunks(a.chunks)):
        yield (sl, a.blocks[i])

If the slices are not strictly needed, this could be simplified a bit more.

def split_by_chunks(a):
    for i in np.ndindex(a.numblocks):
        yield a.blocks[i]

Admittedly slices_from_chunks is an internal utility function. Though it is unlikely to change. We could consider exposing it as part of the API if that is useful.

We could consider other things like making .blocks iterable, which could make this more friendly as well. Raised issue ( https://github.com/dask/dask/issues/5117 ) on this point.

jakirkham avatar Jul 19 '19 00:07 jakirkham

map_blocks went in as of #3276. We'll leave this open for the future work implementing map_overlap.

jhamman avatar Oct 11 '19 00:10 jhamman

I'm thinking through a map_overlap API right now. In dask, map_overlap requires a few extra arguments

    depth: int, tuple, dict or list
        The number of elements that each block should share with its neighbors
        If a tuple or dict then this can be different per axis.
        If a list then each element of that list must be an int, tuple or dict
        defining depth for the corresponding array in `args`.
        Asymmetric depths may be specified using a dict value of (-/+) tuples.
        Note that asymmetric depths are currently only supported when
        ``boundary`` is 'none'.
        The default value is 0.
    boundary: str, tuple, dict or list
        How to handle the boundaries.
        Values include 'reflect', 'periodic', 'nearest', 'none',
        or any constant value like 0 or np.nan.
        If a list then each element must be a str, tuple or dict defining the
        boundary for the corresponding array in `args`.
        The default value is 'reflect'.

In dask.array those must be dicts whose keys are the axis number. For xarray we would want to allow the dimension names there.

I'm not sure how to handle the DataArray labels for the boundary chunks (dask docs at https://docs.dask.org/en/latest/array-overlap.html#boundaries). For reflect / periodic I think things are OK, we perhaps just use the label associated with that value. I'm not sure what to do for constants.

TomAugspurger avatar Aug 03 '20 19:08 TomAugspurger

This issue about coordinate labels for boundaries exists with pad too: https://github.com/pydata/xarray/issues/3868

Can map_overlap just use DataArray.pad and we can fix things there?

Or perhaps we can expect users to add a call to pad before map_overlap?

dcherian avatar Aug 03 '20 20:08 dcherian

Thanks for that link. I hope that map_overlap could use pad internally for the external boundaries.

On Mon, Aug 3, 2020 at 3:22 PM Deepak Cherian [email protected] wrote:

This issue about coordinate labels for boundaries exists with pad too: #3868 https://github.com/pydata/xarray/issues/3868

Can map_overlap just use DataArray.pad and we can fix things there?

Or perhaps we can expect users to add a call to pad before map_overlap?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/3147#issuecomment-668223125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWLGJZYO63S7IXTEH3R64MAZANCNFSM4IFAIWOA .

TomAugspurger avatar Aug 03 '20 21:08 TomAugspurger

Yeah +1 for using pad instead. Had tried to get rid of map_overlap's padding and use da.pad in Dask as well ( https://github.com/dask/dask/pull/5052 ), but haven't had time to get back to that.

jakirkham avatar Aug 03 '20 22:08 jakirkham

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

stale[bot] avatar Apr 17 '22 17:04 stale[bot]

Would be good to keep this open

jakirkham avatar Apr 17 '22 19:04 jakirkham

Indeed, this would be very useful in a great number of cases!

j2bbayle avatar Apr 25 '22 08:04 j2bbayle

very much in need of this one to able satellite image filtering across blocks

jdoblas avatar Apr 08 '23 21:04 jdoblas

+1 that would be super useful.

odhondt avatar Sep 24 '24 10:09 odhondt