xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Chunking causes unrelated non-dimension coordinate to become a dask array

Open chrisroat opened this issue 3 years ago • 2 comments

What happened:

Rechunking along an independent dimension causes unrelated non-dimension coordinates to become dask arrays. The dimension coordinates do not seem affected.

I can stick in a synchronous compute on the coordinate to recover, but wanted to be sure this was the expected behavior.

What you expected to happen:

Chunking along an unrelated dimension should not affect unrelated non-dimension coordinates.

Minimal Complete Verifiable Example:

import xarray as xr
import dask.array as da

def print_coords(a, title):
    print()
    print(title)
    for dim in ['x', 'y', 'b']:
        if dim in a.dims or dim in a.coords:
            print('dim:', dim, 'type:', type(a.coords[dim].data))

arr = xr.DataArray(da.zeros((20, 20), chunks=10), dims=('x', 'y'), 
                   coords={'b': ('y', range(100,120)), 
                           'x': range(20), 
                           'y': range(20)})

print_coords(arr, 'Original')

# The following line rechunks independently of b or y.
# Removing this line allows the code to succeed.
arr = arr.chunk({'x': 5})

print_coords(arr, 'After chunking')

arr = arr.sel(y=2)

print_coords(arr, 'After selection')

print()
print('Scalar values:')
print('y=', arr.coords['y'].item())
print('b=', arr.coords['b'].item())  # Sad Panda
Original
dim: x type: <class 'numpy.ndarray'>
dim: y type: <class 'numpy.ndarray'>
dim: b type: <class 'numpy.ndarray'>

After chunking
dim: x type: <class 'numpy.ndarray'>
dim: y type: <class 'numpy.ndarray'>
dim: b type: <class 'dask.array.core.Array'>

After selection
dim: x type: <class 'numpy.ndarray'>
dim: y type: <class 'numpy.ndarray'>
dim: b type: <class 'dask.array.core.Array'>

Scalar values:
y= 2

<stack trace elided>
NotImplementedError: 'item' is not yet a valid method on dask arrays

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 4.19.112+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: None

xarray: 0.15.1 pandas: 1.0.5 numpy: 1.18.5 scipy: 1.4.1 netCDF4: None pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: 2.4.0 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.19.0 distributed: 2.19.0 matplotlib: 3.2.2 cartopy: None seaborn: None numbagg: None setuptools: 49.1.0.post20200704 pip: 20.1.1 conda: 4.8.3 pytest: 5.4.3 IPython: 7.16.1 sphinx: None

chrisroat avatar Jul 07 '20 02:07 chrisroat

I have the same problem in xarray 2022.3.0. The issue is that this creates unnecessary dask tasks in the graph and some operations acting on the coordinates unexpectedly trigger dask computations. "Unexpected" because the coordinates at the beginning of the process where not chunked. So computation that was expected to happen in the main thread (or not happen at all) is now happenning in the dask workers.

An example:

import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar

# A 2D variable
da = xr.DataArray(
    np.ones((12, 10)),
    dims=('x', 'y'),
    coords={'x': np.arange(12), 'y': np.arange(10)}
 )

# A 1D variable sharing a dim with da
db = xr.DataArray(
    np.ones((12,)),
    dims=('x'),
    coords={'x': np.arange(12)}
)

# A non-dimension coordinate
cx = xr.DataArray(np.zeros((12,)), dims=('x',), coords={'x': np.arange(12)})

# Assign it to da and db
da = da.assign_coords(cx=cx)
db = db.assign_coords(cx=cx)

# We need to chunk along y
da = da.chunk({'y': 1})

# Notice how `cx` is now a dask array, even if it is a 1D coordinate and does not have 'Y' as a dimension.
print(da)

# This triggers a dask computation
with ProgressBar():
    da - db

The reason my example triggers dask is that xarray ensure the coordinates are aligned and equal (I think?). Anyway, I didn't expect it.

Personally, I think the chunk method shouldn't apply to the coordinates at all, no matter their dimensions. They're coordinate so we expect to be able to read them easily when aligning/comparing dataset. Dask is to be used with the "real" data only. Does this vision fit the one from the devs? I feel this "skip" could be easily implemented.

aulemahal avatar Jul 06 '22 20:07 aulemahal

It makes sense to me that chunking along a dimension dim should not chunk variables that don't have that dimension.

@shoyer what do you think

dcherian avatar Jul 12 '22 15:07 dcherian