xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Explicit indexes: next steps

Open benbovy opened this issue 2 years ago • 2 comments

#5692 is ~~not merged yet~~ now merged ~~but~~ and we can ~~already~~ start thinking about the next steps. I’m opening this issue to list and track the remaining tasks. @pydata/xarray, do not hesitate to add a comment below if you think about something that is missing here.

Continue the refactoring of the internals

Although in #5692 everything seems to work with the current pandas index wrappers for dimension coordinates, not all of Xarray's internals have been refactored yet to fully support (or at least be compatible with) custom indexes. Here is a list of Dataset / DataArray methods that still need to be checked / updated (this list may be incomplete):

  • [ ] broadcast (#6430, #6481 )
  • [ ] drop_sel
  • [ ] drop_isel
  • [ ] drop_dims
  • [ ] transpose
  • [ ] interpolate_na
  • [ ] ffill
  • [ ] bfill
  • [ ] reduce
  • [ ] map
  • [ ] apply
  • [ ] quantile
  • [ ] rank
  • [ ] integrate
  • [ ] cumulative_integrate
  • [ ] filter_by_attrs
  • [ ] idxmin
  • [ ] idxmax
  • [ ] argmin
  • [ ] argmax
  • [ ] concat (partially refactored, may not fully work with multi-dimension indexes)

I ended up following a common pattern in #5692 when adding explicit / flexible index support for various features (it is quite generic, though, the actual procedure may vary from one case to another and many steps may be skipped):

  • Check if it’s worth adding a new method to the Xarray Index base class. There may be several motivations:
    • Avoid handling Pandas index objects inside Dataset or DataArray methods (even if we don’t plan to fully support custom indexes for everything, it is preferable to put this logic behind the PandasIndex or PandasMultiIndex wrapper classes for clarity and also if eventually we want to make Xarray less dependent on Pandas)
    • We want a specific implementation rather than relying on the Variable’s corresponding method for speed-up or for other reasons, e.g.,
      • IndexVariable.concat exists to avoid unnecessary Pandas/Numpy conversions ; in #5692 PandasIndex.concat has the same logic and will fully replace the former if/once we get rid of IndexVariable
      • PandasIndex.roll reuses pandas.Index indexing and append capabilities
  • Index API closely follows DataArray, Dataset and Variable API (i.e., same method names) for consistency
  • Within the Dataset or DataArray method, first call the Index API (if it exists) to create new indexes
    • The Indexes class (i.e., the .xindexes property returns an instance of this class) provides convenient API for iterating through indexes (e.g., get a list of unique indexes, get all coordinates or dimensions for a given index, etc.)
    • If there’s no implementation for the called Index API, either raise an error or fallback to calling the Variable API (below) depending on the case
  • Create new coordinate variables for each of the new indexes using Index.create_variables
    • It is possible to pass a dict of current coordinate variables to Index.create_variables ; it is used to propagate variable metadata (dtype, attrs and encoding)
    • Not all indexes should create new coordinate variables, only those for which it is possible to reuse index data as coordinate variable data (like Pandas indexes)
  • Iterate through the variables and call the Variable API (if it exists)
    • Skip new coordinate variables created at the previous step (just reuse it)
  • Propagate the indexes that are not affected by the operation and clean up all indexes, i.e., ensure consistency between indexes and coordinate variables
    • There is a couple of convenient methods that have been added in #5692 for that purpose: filter_indexes_from_coords and assert_no_index_corrupted
  • Replace indexes and variables, e.g., using _replace, _replace_with_new_dims or _overwrite_indexes methods

Relax all constraints related to “dimension (index) coordinates” in Xarray

  • [ ] Allow multi-dimensional variables with the name matching one of its dimensions: #2233 #2405 (https://github.com/pydata/xarray/pull/2405#issuecomment-419969570)

Indexes repr

  • [ ] Add an Indexes section to Dataset and DataArray reprs
    • #6795
  • [ ] Make the repr of Indexes (i.e., .xindexes property) consistent with the repr of Coordinates (.coords property)
  • [ ] Add Index._repr_inline_ for tweaking the inline representation of each index shown in the reprs above

Public API for assigning and (re)setting indexes

There is no public API yet for creating and/or assigning existing indexes to Dataset and DataArray objects.

  • [ ] Enable and/or document the indexes parameter in Dataset and DataArray constructors
    • [ ] Depreciate the implicit creation of pandas multi-index wrappers (and their corresponding coordinates) from anything passed via the data, data_vars or coords arguments in favor of a more explicit way to pass it.
    • [ ] https://github.com/pydata/xarray/issues/6633 (pass empty dictionary)
  • [ ] Add set_xindex and drop_indexes methods
    • #6849
    • #6971
    • Depreciate set_index and reset_index? See https://github.com/pydata/xarray/issues/4366#issuecomment-920458966

We still need to figure out how best we can (1) assign existing indexes (possibly with their coordinates) and (2) pass index build options.

Other public API for index-based operations

To fully leverage the power and flexibility of custom indexes, we might want to update some parts of Xarray’s public API in order to allow passing arbitrary options per index. For example:

  • [ ] sel: the current method and tolerance may not be relevant for all indexes, pass extra arguments to Scipy's cKDTree.query, etc. #7099
  • [ ] align: #2217

Also:

  • [ ] Make public the Indexes API as it provides convenient methods that might be useful for end-users
  • [ ] Import the Index base class into Xarray’s main namespace (i.e., xr.Index)? Also PandasIndex and PandasMultiIndex? The latter may be useful if we depreciate set_index(append=True) and/or if we depreciate “unpacking” pandas.MultiIndex objects to coordinates when given as coords in the Dataset / DataArray constructors.
    • [ ] Add references in docstrings (https://github.com/pydata/xarray/pull/5692#discussion_r820117354).

Documentation

  • [ ] User guide:
    • [ ] Update the “Terminology” section: “Index” may include custom indexes, review “Dimension coordinate” / “Non-dimension coordinate” as “Indexed coordinate” / “Non-indexed coordinate”
    • [ ] Update the “Data structure” section such that it clearly mentions indexes as 1st class citizen of the Xarray data model
    • [ ] Maybe update other parts of the documentation that refer to the concept of “dimension coordinate”
  • [ ] API reference:
    • [ ] add Indexes API
    • [ ] add Index API: #6975
  • [ ] Xarray internals: add a subsection on how to add custom indexes, maybe with some basic examples: #6975
  • [ ] Update development roadmap section

Index types and helper classes built in Xarray

  • [ ] Since a lot of potential use-cases for custom indexes may consist in adding some extra logic on top of one or more pandas indexes along one or more dimensions (i.e., “meta-indexes”), it might be worth providing a helper Index abstract subclass that would basically dispatch the given arguments to the corresponding, encapsulated PandasIndex instances and then merge the results
  • [ ] Depreciate PandasMultiIndex dimension coordinate?

3rd party indexes

  • [ ] Add custom index entrypoint / plugin system, similarly to storage backend entrypoints

benbovy avatar Feb 23 '22 12:02 benbovy

Following thoughts and discussions in various issues (e.g., #6836), I'd like to suggest another section to the ones in the top comment:

Deprecate pandas.MultiIndex special cases in Xarray

  • remove the multi-index “dimension” coordinate (tuple elements)
  • do not automatically promote pandas.MultiIndex objects as dimension + level coordinates, e.g., like in xr.Dataset(coords={“x”: pd_midx}) but instead treat it as a single duck-array.
  • do not accept pandas.MultiIndex as dim argument in xarray.concat() (#7148)
  • remove Dataset.to_index()

They are source of many problems and complexities in Xarray internals (many regressions reported since the index refactor were related to those special cases) and I'm not sure that the value they add is really worth the trouble. Also, in the long term the special treatment of PandasMultiIndex vs. other Xarray multi-indexes may add some confusion.

Some of those features are widely used (e.g., the creation of Dataset / DataArray from pandas multi-indexes is used in many places in unit tests), so we would need convenient alternatives and a smooth transition.

benbovy avatar Sep 27 '22 09:09 benbovy

Yes yes -- the sooner we can get rid of MultiIndex special cases the better!

shoyer avatar Sep 28 '22 00:09 shoyer