xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Inconsistent behavior between `DatasetRolling.construct` and `DataArrayRolling.construct` with stride > 1.

Open p4perf4ce opened this issue 3 years ago • 2 comments

What is your issue?

INSTALLED VERSIONS

commit: None python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.0-73-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.0

xarray: 2022.6.0 pandas: 1.4.2 numpy: 1.19.5 scipy: 1.7.0 netCDF4: 1.6.0 pydap: None h5netcdf: 1.0.2 h5py: 3.1.0 Nio: None zarr: 2.12.0 cftime: 1.6.1 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.4 dask: 2021.06.2 distributed: 2021.06.2 matplotlib: 3.5.3 cartopy: None seaborn: 0.12.0 numbagg: None fsspec: 2021.07.0 cupy: 9.2.0 pint: None sparse: 0.13.0 flox: None numpy_groupies: None setuptools: 49.6.0.post20210108 pip: 21.1.3 conda: 4.10.3 pytest: 6.2.4 IPython: 7.24.1 sphinx: None

Reproducing the problem

I have an xarray Dataset with a single dimension as specified. (Or any arbitrary Xarray's Dataset

> Dimensions:
> time: 11058688

When applied rolling operation on DataArray with no overlapping window, it is working as one would normally expected.

dataset.var_a.rolling(k=256).construct('w', stride=256)

11058688 / 256 = 43198

> Dimensions:
> time: 43198, k:  256   # 43198 windows

However when applying the same operation to the Dataset:

dataset.rolling(k=256).construct('w', stride=256)
> Dimensions:
> time: 169, k:  256   # How can we even arrived at 169 windows?

I don't see any reasons why should rolling on Dataset and DataArray should behave differently. Shouldn't rolling on dataset is just repeating DataArray rolling on every data variable? This differing behavior is not mentioned on the documentation either.

p4perf4ce avatar Sep 12 '22 15:09 p4perf4ce

Thanks for the report & I agree that this should lead to the same but the code paths are indeed different - but I have not looked in to the actual root cause. Could be that this is also not super thoroughly tested (and used!):

https://github.com/pydata/xarray/blob/b018442c8dfa3e71ec35e294de69e2011949afec/xarray/core/rolling.py#L289

https://github.com/pydata/xarray/blob/b018442c8dfa3e71ec35e294de69e2011949afec/xarray/core/rolling.py#L721

B.t.w. a copy-pastable example would be appreciated.

mathause avatar Sep 12 '22 22:09 mathause

Thanks for the report & I agree that this should lead to the same but the code paths are indeed different - but I have not looked in to the actual root cause. Could be that this is also not super thoroughly tested (and used!):

https://github.com/pydata/xarray/blob/b018442c8dfa3e71ec35e294de69e2011949afec/xarray/core/rolling.py#L289

https://github.com/pydata/xarray/blob/b018442c8dfa3e71ec35e294de69e2011949afec/xarray/core/rolling.py#L721

B.t.w. a copy-pastable example would be appreciated.

Thanks for the response, here is a straightforward example.

import xarray as xr
dummy = list(range(100))
x, y, z = [xr.DataArray(dummy, dims=['t']) for _ in range(3)]
ds = xr.Dataset(
    {'x': x, 'y': y, 'z': z}
)
print(x.rolling(t=4).construct('w', stride=4).shape)
print(ds.rolling(t=4).construct('w', stride=4).x.shape)

Results:

> (25, 4)
> (7, 4)

I had a hunch that the problem come from this part - not quite sure what self._mapping_to_list did here, haven't look it up yet. https://github.com/pydata/xarray/blob/b018442c8dfa3e71ec35e294de69e2011949afec/xarray/core/rolling.py#L764-L772

Since I only had one dimension to deal with, removing this loop solves the problem for me.

p4perf4ce avatar Sep 12 '22 22:09 p4perf4ce

Been half a year and I found myself stuck at this inconsistent behavior again. Another problem I found but haven't mentioned yet is that DatasetRolling.construct will swap the rolling dimension name with window_dim when DataArrayRolling.construct doesn't.

This time, I've actually identified a cause for this problem below:

https://github.com/pydata/xarray/blob/b018442c8dfa3e71ec35e294de69e2011949afec/xarray/core/rolling.py#L789-L791

.isel({d: slice(None, None, s) for d, s in zip(self.dim, strides)}) 

I currently still can't figure it out what is the original intention that .isel trying to achieve since it causes so much problem without any benefit. It should be noted that this can explode the memory if xr.Dataset is reasonably large (It just explode 3 channels PPG, 135Hz, 6Hrs of recording, a mere 300MB to 20-40GB++, so I think this is critical).

Solution

Removing .isel part fixed everything.

Test case

test_arr = xr.DataArray(np.arange(8).reshape(2, 4), dims=('a', 'b'))  # Borrowed from `DataArray.__doc__`'s example.
test_dset= xr.Dataset(data_vars={i: tr for i in range(3)})

DataArray

tr.rolling(b=2).construct('window_dim', stride=2)

>>> <xarray.DataArray (a: 2, b: 2, window_dim: 2)>
array([[[nan,  0.],
        [ 1.,  2.]],

       [[nan,  4.],
        [ 5.,  6.]]])
Dimensions without coordinates: a, b, window_dim

Dataset

trd.rolling(b=2).construct('window_dim', stride=2)

>>> <xarray.Dataset>
Dimensions:  (a: 2, b: 2, window_dim: 2)
Dimensions without coordinates: a, b, window_dim
Data variables:
    0        (a, b, window_dim) float64 nan 0.0 1.0 2.0 nan 4.0 5.0 6.0
    1        (a, b, window_dim) float64 nan 0.0 1.0 2.0 nan 4.0 5.0 6.0
    2        (a, b, window_dim) float64 nan 0.0 1.0 2.0 nan 4.0 5.0 6.0

trd.rolling(b=2).construct('window_dim', stride=2)[0]

>>> <xarray.DataArray 0 (a: 2, b: 2, window_dim: 2)>
array([[[nan,  0.],
        [ 1.,  2.]],

       [[nan,  4.],
        [ 5.,  6.]]])
Dimensions without coordinates: a, b, window_dim

p4perf4ce avatar Mar 02 '23 20:03 p4perf4ce