xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Allow setting (or skipping) new indexes in open_dataset

Open benbovy opened this issue 2 years ago • 9 comments

  • [x] Closes #6633
  • [ ] Tests added
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

This PR introduces a new boolean parameter set_indexes=True to xr.open_dataset(), which may be used to skip the creation of default (pandas) indexes when opening a dataset.

Currently works with the Zarr backend:

import numpy as np
import xarray as xr

# example dataset (real dataset may be much larger)
arr = np.random.random(size=1_000_000)
xr.Dataset({"x": arr}).to_zarr("dataset.zarr")


xr.open_dataset("dataset.zarr", set_indexes=False, engine="zarr")
# <xarray.Dataset>
# Dimensions:  (x: 1000000)
# Coordinates:
#     x        (x) float64 ...
# Data variables:
#     *empty*


xr.open_zarr("dataset.zarr", set_indexes=False)
# <xarray.Dataset>
# Dimensions:  (x: 1000000)
# Coordinates:
#     x        (x) float64 ...
# Data variables:
#     *empty*

I'll add it to the other Xarray backends as well, but I'd like to get your thoughts about the API first.

  1. Do we want to add yet another keyword parameter to xr.open_dataset()? There are already many...
  2. Do we want to add this parameter to the BackendEntrypoint.open_dataset() API?
  • I'm afraid we must do it if we want this parameter in xr.open_dataset()
  • this would also make it possible skipping the creation of custom indexes (if any) in custom IO backends
  • con: if we require set_indexes in the signature in addition to the drop_variables parameter, this is a breaking change for all existing 3rd-party backends. Or should we group set_indexes with the other xarray decoder kwargs? This would feel a bit odd to me as setting indexes is different from decoding data.
  1. Or should we leave this up to the backends?
  • pros: no breaking change, more flexible (3rd party backends may want to offer more control like choosing between custom indexes and default pandas indexes or skipping the creation of indexes by default)
  • cons: less discoverable, consistency is not enforced across 3rd party backends (although for such advanced case this is probably OK), not available by default in every backend.

Currently 1 and 2 are implemented in this PR, although as I write this comment I think that I would prefer 3. I guess this depends on whether we prefer open_*** vs. xr.open_dataset(engine="***") and unless I missed something there is still no real consensus about that? (e.g., #7496).

benbovy avatar Aug 07 '23 10:08 benbovy

@pydata/xarray any thoughts on which option among those above (top comment) would be best?

benbovy avatar Aug 22 '23 14:08 benbovy

  1. Do we want to add yet another keyword parameter to xr.open_dataset()?

I vote for this.

rabernat avatar Aug 23 '23 14:08 rabernat

Is there any way for packages to say they support V1 or V2 of an entrypoint?

I just realized that if we turned off indexes by default, it would be a big win for the open_mfdataset case in many cases.

dcherian avatar Nov 13 '23 20:11 dcherian

  • con: if we require set_indexes in the signature in addition to the drop_variables parameter, this is a breaking change for all existing 3rd-party backends

Doesn't help with the problem at this moment, but could we add having **kwargs to the standard, so we can add parameters in future without breaking existing backends?

max-sixty avatar Nov 13 '23 20:11 max-sixty

Agreed, adding **kwargs to the standard would help! However, to be honest I find it already a bit confusing how kwargs are handled in xarray.backends.api. #8447 may eventually help having a clearer separation between common and backend-specific options.

benbovy avatar Nov 14 '23 12:11 benbovy

This looks great to me!

I agree with adding this into xarray.open_dataset() and BackendEntrypoint.open_dataset().

For what it's worth, I think it's OK to require backend developers to update their code more frequently -- we don't need the same level of stabily that we need for user level APIs.

shoyer avatar Nov 22 '23 16:11 shoyer

I just realized that if we turned off indexes by default, it would be a big win for the open_mfdataset case in many cases.

I would love to see this merged so that I can try this out!

TomNicholas avatar Dec 05 '23 20:12 TomNicholas

be aware that merging now will break compatibility with any 3rd party backend, which I believe is not something we should do, even if we think that the transition window can be shorter than usual.

I my eyes the easiest way forward would be:

  1. extend Backend.open_dataset with **kwargs as suggested by @max-sixty
  2. wait a couple of versions (2?) to give backends the time to update and release
  3. add the new option

We don't have an easy way to contact all backend developers, unfortunately.

Edit: let's discuss in the meeting today

keewis avatar Dec 05 '23 20:12 keewis

In the meeting just now we decided to inspect the signature of the backend's open_dataset method and not pass the new option if it doesn't support it nor accepts **kwargs.

We should still change the spec to require **kwargs (in a new PR), and maybe emit a deprecation warning for all backends that don't have it already.

keewis avatar Dec 06 '23 17:12 keewis