xarray
xarray copied to clipboard
Confusing terminologies and some errors in the official documentation
What happened?
To note, I'm using the stable version(2022.6.0).
First, I'm confused that both dimension coordinate/non-dimension coordinate and index coordinate/non-index coordinate appear in the documentation(search to see), but they seem to be the same thing.
Second, I found that there are some errors in the documentation:
-
It says that "The index associated with dimension name x can be retrieved by arr.indexes[x]. By construction,
len(arr.dims) == len(arr.indexes)", which is inconsistent with actual behavior. See example code below:In [0]: import xarray as xr, numpy as np In [1]: arr = xr.DataArray(np.zeros((2, 3)), dims=['x', 'y'], coords={'x': ['a', 'b']}) In [2]: assert len(arr.dims) == len(arr.indexes), f"{len(arr.dims)=}, {len(arr.indexes)=}" --------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-202-f217d18e6979> in <module> ----> 1 assert len(arr.dims) == len(arr.indexes), f"{len(arr.dims)=}, {len(arr.indexes)=}" AssertionError: len(arr.dims)=2, len(arr.indexes)=1 In [3]: arr.indexes Out[3]: Indexes: x: Index(['a', 'b'], dtype='object', name='x')It seems that
arr.indexesonly returns indexes of dimensions that have coordinates. However, it's possible to get the index of dimensionythroughget_index():In [4]: arr.get_index('y') Out[4]: RangeIndex(start=0, stop=3, step=1, name='y') -
It says that: (see link)
For convenience multi-index levels are directly accessible as “virtual” or “derived” coordinates (marked by - when printing a dataset or data array):
In [77]: mda["band"] Out[77]: <xarray.DataArray 'band' (spec: 4)> array(['R', 'R', 'V', 'V'], dtype=object) Coordinates: * spec (spec) object MultiIndex * band (spec) object 'R' 'R' 'V' 'V' * wn (spec) float64 0.1 0.2 0.7 0.9 In [78]: mda.wn Out[78]: <xarray.DataArray 'wn' (spec: 4)> array([0.1, 0.2, 0.7, 0.9]) Coordinates: * spec (spec) object MultiIndex * band (spec) object 'R' 'R' 'V' 'V' * wn (spec) float64 0.1 0.2 0.7 0.9As you can see, even in the given example code offered by the offical, all the "virtual" coordinates are marked as
*instead of-, which is a little bit confusing when handling multi-index coordinates in my experience.
May I have missed something? Thanks in advance for the reply.
What did you expect to happen?
No response
Minimal Complete Verifiable Example
No response
MVCE confirmation
- [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
- [ ] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None python: 3.8.10 (default, Sep 28 2021, 16:10:42) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.10.102.1-microsoft-standard-WSL2 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: C.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None
xarray: 2022.6.0 pandas: 1.4.3 numpy: 1.23.1 scipy: 1.3.3 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: None distributed: None matplotlib: 3.1.2 cartopy: None seaborn: None numbagg: None fsspec: None cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 45.2.0 pip: 22.2.1 conda: None pytest: None IPython: 7.13.0 sphinx: None
Hi @v-liuwei, thanks for the report.
The issues that your are pointing are part of #6293. There has been many internal changes (+ some subtle public-facing changes) regarding indexes in the last release, but there is still some work for reflecting it in the documentation.
First, I'm confused that both dimension coordinate/non-dimension coordinate and index coordinate/non-index coordinate appear in the documentation(search to see), but they seem to be the same thing.
I agree, this has always been a source of confusion IMO. Xarray's data model has been updated in the last release such that these two concepts are now different and independent (i.e., it allows a non-dimension coordinate to have an index).
It seems that arr.indexes only returns indexes of dimensions that have coordinates. However, it's possible to get the index of dimension y through get_index()
get_index() creates a pandas index on the fly if it doesn't exists (and if that's possible). I'm wondering whether or not we should eventually depreciate it? I might be missing important use cases, though.
As you can see, even in the given example code offered by the offical, all the "virtual" coordinates are marked as * instead of -, which is a little bit confusing when handling multi-index coordinates in my experience.
This is because multi-index levels now have each their own, real coordinate (the documentation is not yet up-to-date). However, I agree that using the same symbol for multi-coordinate indexes may not be ideal as it is hard to distinguish which coordinate is associated with which index. On the other hand, using two different symbols wouldn't be an elegant solution either if we later depreciate the multi-index dimension coordinate (i.e., spec in your example). Maybe this issue could be addressed in the indexes repr section to be added (#6795).
Thanks for your explanations.
You said that "it allows a non-dimension coordinate to have an index", which confuses me even more. I want to confirm that, should we always(or is it only possible to) use the index coordinates to index the DataArray/Dataset in a label fasion?
Yes, performing selection using coordinate labels (i.e., .sel()) is only possible for coordinates that have an index. It has always been the case and it will always be.
Before v2022.6.0, only 1-dimensional coordinates with the name matching the dimension name could have a pandas index or multi-index. Hence the distinction between a "dimension coordinate" which most often implicitly wrapped a pandas index and a "non-dimension" coordinate for which label-based selection was impossible.
Starting from v2022.6.0, this constraint is relaxed. Although it is not yet fully operational, any coordinate or any group of coordinates (with arbitrary dimensions) may now have an index (either pandas-based or any xarray compatible custom index) and may therefore be used for label-based selection (if the index supports it).
I'm closing this issue as the terminology section has been updated in #7368, which now clearly distinguish between (non)dimension coordinate and (non)indexed coordinate. For the multi-index "virtual" coordinates in the repr let's track it in #8071.