[Bug]: cannot chunk a DataArray that originated as a coordinate
What happened?
If I construct the following DataArray, and try to chunk its "x" coordinate, I get back a NumPy-backed DataArray:
In [2]: a = xr.DataArray([1, 2, 3], dims=["x"], coords=[[4, 5, 6]])
In [3]: a.x.chunk()
Out[3]:
<xarray.DataArray 'x' (x: 3)>
array([4, 5, 6])
Coordinates:
* x (x) int64 4 5 6
If I construct a copy of the "x" coordinate, things work as I would expect:
In [4]: x = xr.DataArray(a.x, dims=a.x.dims, coords=a.x.coords, name="x")
In [5]: x.chunk()
Out[5]:
<xarray.DataArray 'x' (x: 3)>
dask.array<xarray-<this-array>, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
* x (x) int64 4 5 6
What did you expect to happen?
I would expect the following to happen:
In [2]: a = xr.DataArray([1, 2, 3], dims=["x"], coords=[[4, 5, 6]])
In [3]: a.x.chunk()
Out[3]:
<xarray.DataArray 'x' (x: 3)>
dask.array<xarray-<this-array>, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
Coordinates:
* x (x) int64 4 5 6
Minimal Complete Verifiable Example
No response
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 15:59:12) [Clang 11.0.1 ] python-bits: 64 OS: Darwin OS-release: 21.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.10.5 libnetcdf: 4.6.3
xarray: 0.20.1 pandas: 1.3.5 numpy: 1.19.4 scipy: 1.5.4 netCDF4: 1.5.5 pydap: None h5netcdf: 0.8.1 h5py: 2.10.0 Nio: None zarr: 2.7.0 cftime: 1.2.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.22.0 distributed: None matplotlib: 3.2.2 cartopy: 0.19.0.post1 seaborn: None numbagg: None fsspec: 2021.06.0 cupy: None pint: 0.15 sparse: None setuptools: 49.6.0.post20210108 pip: 20.2.4 conda: 4.10.1 pytest: 6.0.1 IPython: 7.27.0 sphinx: 3.2.1
I've run in to this before. The underlying variable object is IndexVariable which has a dummy chunk method https://github.com/pydata/xarray/blob/95bb9ae4233c16639682a532c14b26a3ea2728f3/xarray/core/variable.py#L2707-L2709
I encountered a similar issue when I tried to create a new array from a set of Index Coordinates, e.g.,
# ds.z, ds.y, ds.x are fully loaded into memory, and so is ds.coords.r.
ds.coords['r'] = np.sqrt((ds.z - z0)**2 + (ds.y - y0)**2 + (ds.x - x0)**2)
I'm currently circumventing the problem by using _chunk_like introduced in https://github.com/Ouranosinc/xclim/pull/1542.
x, y, z = _chunk_like(ds.x, ds.y, ds.z, chunks=ds.chunksizes)
ds.coords['r'] = np.sqrt((z - z0)**2 + (y - y0)**2 + (x - x0)**2)
# Now, ds.coords.r carries a dask array!
As @dcherian noted, the underlying cause of the issue seems to be that the IndexVariable always have to be fully loaded into memory.
@sanghyukmoon : while I do not understand, why the IndexVariable always has to be fully loaded into memory, I can accept that. But is it also necessary, that it is never chunked?
Having it fully loaded and having it chunked could be easily achieved, using dask.array.Array.persist("synchronous").
Indeed, I was very surprised, when xr.apply_ufunc(foo, coord1, coord2, ...) did not broadcast the coordinates consistently with xr.apply_ufunc(foo, coord1, coord2, dummyvar, ...) - where the coordinates suddenly appear in chunks corresponding to the chunk sizes of the dummyvar
Especially, the fact that coord1.chunk() returns silently without chunking violates both, Liskov Substitution Principle (LSP)(removes inherited behavior instead of extending it) and the Principle of Least Surprise (does not chunk even though the the function name suggests it)
If, additionally to the fact, that it always has to be fully loaded, it is also necessary, that it is never chunked, I'd suggest:
move the chunk()-functionality into a private method _chunk() that is modified just as currently chunk() is. Call this from both Variable.chunk() and ds.chunk() - this avoids handling of special cases there.
For the IndexVariable there are then various options to deal with the case, the user explicitly calls chunk() on a IndexVariable (all of them still break LSP as the self._replace is omitted, but the surprise is reduced):
a: Log the refusal and return the modified _chunk()
b: Raise an Exception.
c: Delete the Function.
d: Return a chunked copy or view as regular Variable - essentially omit the call to self._replace(...).
Another alternative would be, to provide a side-effect-free alternative to chunk(), e.g. chunked() which returns that chunked copy from d consistently for all subtypes of Variables. Then, deprecate the state-changing Variable.chunk() and only promote the DataSet.chunk()-method as state-changing (i.e. calling a private _chunk()). This would allow compliance with the LSP.