xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Allow no padding for rolling windows

Open kmsquire opened this issue 2 years ago • 15 comments

  • [x] Fixes #4743
  • [x] Tests added
  • [x] Passes pre-commit run --all-files
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst

Related to #2007.

kmsquire avatar Jul 14 '21 19:07 kmsquire

Unit Test Results

         6 files           6 suites   1h 9m 48s :stopwatch: 17 081 tests 15 238 :heavy_check_mark: 1 735 :zzz: 108 :x: 95 682 runs  86 859 :heavy_check_mark: 8 175 :zzz: 648 :x:

For more details on these failures, see this check.

Results for commit 5a2eadc4.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 14 '21 19:07 github-actions[bot]

FWIW, the test failures are all the same error, which should be fixed, but should have nothing to do with this PR.

kmsquire avatar Jul 14 '21 21:07 kmsquire

Thanks @kmsquire I'll try and review this soon.

dcherian avatar Jul 16 '21 16:07 dcherian

Hello @kmsquire! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2021-08-11 16:38:29 UTC

pep8speaks avatar Jul 16 '21 20:07 pep8speaks

@dcherian Thanks. I thought I was done, but I'm finding things that aren't working as expected. I'm working on adding more tests for expected behavior, so marking as WIP. Will ping back here when I'm finished (maybe today, maybe early next week).

kmsquire avatar Jul 16 '21 20:07 kmsquire

@dcherian Okay, I think this is in good shape. I added some more tests, and fixed a few more bugs. Most of the fixes have been squashed back down to the original commit.

I left the second commit separate for now because it's breaking. Previously, iterating over a rolling window returned only returned blocks present in the original array, and ignored the chunk of nans before the start or after the end of the array. This meant that the view of the DataArray or Dataset that was returned for each iteration was potentially a different size.

Here, instead, the iterator was changed so that each returned view matches the corresponding slice of the output of the construct() function. To me, this seems more intuitive, and makes it easier to develop algorithms without having to call construct.

Other than the fact that it's breaking, the main drawback (and difference with construct()) is that the windows themselves are labeled with coordinates along the rolling axis, except for any nan values before the start or after the end of the actual data, which do not have labels. For comparison, in construct, the window dimension does not have any coordinates associated with it by default.

After writing this, I'm wondering if it might be useful to simply drop the coordinates along the rolling axis, so that the behavior matches the behavior of construct() even more closely.

I'm open to thoughts/comments/criticisms/suggestions/questions/whatever.

kmsquire avatar Jul 16 '21 23:07 kmsquire

Also, FWIW, the test failure was caused by a problem in zarr/fsspec (https://github.com/intake/filesystem_spec/issues/707), which is fixed in master on fsspec (https://github.com/intake/filesystem_spec/pull/710).

So it should be fixed here whenever fsspec makes a release.

kmsquire avatar Jul 17 '21 00:07 kmsquire

@dcherian Thank you for reviewing. I've started working through your comments.

kmsquire avatar Jul 19 '21 21:07 kmsquire

However, pad has so many arguments that it might become a nightmare to thread them all through. So maybe just sticking with pad=False is ok.

We could also do pad={"mode": "wrap"} or pad=True, pad_kwargs={...}

I think i prefer .pad(time=5, mode="wrap").rolling(time=5, pad=False) but it does require specifying the window length twice. I was going to bring this up at our next dev meeting (tomorrow!)

cc @pydata/xarray

dcherian avatar Jul 20 '21 11:07 dcherian

The code also introduces a problem, the following no longer works:

xr.DataArray([1, 2, 3], coords=dict(x=[1, 1, 1])).rolling(x=3).mean()

thus you may have to use isel instead of sel if possible. Therefore I was also unable to test the following (from #2007 (comment)):

monthly.pad(month=n_months, mode="wrap").rolling(center=True, month=n_months, pad=False).mean(skipna=False)

Fixed and added a test for this. The example at the bottom now works.

kmsquire avatar Jul 20 '21 22:07 kmsquire

Mentioned in one of the comments above, but I think I've reached about the amount of time that I can spend on this right now. If there are other minor changes, please do let me know. I can also back out the breaking change if desired (although that will probably take some commit surgery).

kmsquire avatar Jul 20 '21 23:07 kmsquire

Thanks @kmsquire we can take it from here.

There was some discussion about syntax at the dev meeting today. There were multiple votes in favour of allowing full control of padding in the rolling object itself. So we could have

  1. .rolling(time=5, x=3, pad={"x": {"mode": "wrap"}, "time": False}), OR
  2. .rolling(time=5, x=3).pad({"x": {"mode": "wrap"}, "time": False})

@pydata/xarray thoughts?

dcherian avatar Jul 21 '21 16:07 dcherian

There was some discussion about syntax at the dev meeting today. There were multiple votes in favour of allowing full control of padding in the rolling object itself. So we could have

  1. .rolling(time=5, x=3, pad={"x": {"mode": "wrap"}, "time": False}), OR
  2. .rolling(time=5, x=3).pad({"x": {"mode": "wrap"}, "time": False})

FWIW, I think 1 would be more efficient (but perhaps harder to implement). With this one, the DataArrayRolling object is only created once, with all of the required information.

If 2 were implemented, what should the return value of the .pad() call be? Would it call construct on the rolling object and return a DataArray? Or would it return an updated DataArrayRolling object? Or would it be a new type entirely (e.g., DataArrayPaddedRolling)?

kmsquire avatar Jul 22 '21 17:07 kmsquire

Your points about (2) being hard to interpret are a good reason to go with (1)! I was thinking it would return a DatasetRolling object but that's not the only interpretation as you point out.

dcherian avatar Jul 22 '21 17:07 dcherian