xarray
xarray copied to clipboard
Allow setting (or skipping) new indexes in open_dataset
- [x] Closes #6633
- [ ] Tests added
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst - [ ] New functions/methods are listed in
api.rst
This PR introduces a new boolean parameter set_indexes=True to xr.open_dataset(), which may be used to skip the creation of default (pandas) indexes when opening a dataset.
Currently works with the Zarr backend:
import numpy as np
import xarray as xr
# example dataset (real dataset may be much larger)
arr = np.random.random(size=1_000_000)
xr.Dataset({"x": arr}).to_zarr("dataset.zarr")
xr.open_dataset("dataset.zarr", set_indexes=False, engine="zarr")
# <xarray.Dataset>
# Dimensions: (x: 1000000)
# Coordinates:
# x (x) float64 ...
# Data variables:
# *empty*
xr.open_zarr("dataset.zarr", set_indexes=False)
# <xarray.Dataset>
# Dimensions: (x: 1000000)
# Coordinates:
# x (x) float64 ...
# Data variables:
# *empty*
I'll add it to the other Xarray backends as well, but I'd like to get your thoughts about the API first.
- Do we want to add yet another keyword parameter to
xr.open_dataset()? There are already many... - Do we want to add this parameter to the
BackendEntrypoint.open_dataset()API?
- I'm afraid we must do it if we want this parameter in
xr.open_dataset() - this would also make it possible skipping the creation of custom indexes (if any) in custom IO backends
- con: if we require
set_indexesin the signature in addition to thedrop_variablesparameter, this is a breaking change for all existing 3rd-party backends. Or should we groupset_indexeswith the other xarray decoder kwargs? This would feel a bit odd to me as setting indexes is different from decoding data.
- Or should we leave this up to the backends?
- pros: no breaking change, more flexible (3rd party backends may want to offer more control like choosing between custom indexes and default pandas indexes or skipping the creation of indexes by default)
- cons: less discoverable, consistency is not enforced across 3rd party backends (although for such advanced case this is probably OK), not available by default in every backend.
Currently 1 and 2 are implemented in this PR, although as I write this comment I think that I would prefer 3. I guess this depends on whether we prefer open_*** vs. xr.open_dataset(engine="***") and unless I missed something there is still no real consensus about that? (e.g., #7496).
@pydata/xarray any thoughts on which option among those above (top comment) would be best?
- Do we want to add yet another keyword parameter to
xr.open_dataset()?
I vote for this.
Is there any way for packages to say they support V1 or V2 of an entrypoint?
I just realized that if we turned off indexes by default, it would be a big win for the open_mfdataset case in many cases.
- con: if we require
set_indexesin the signature in addition to thedrop_variablesparameter, this is a breaking change for all existing 3rd-party backends
Doesn't help with the problem at this moment, but could we add having **kwargs to the standard, so we can add parameters in future without breaking existing backends?
Agreed, adding **kwargs to the standard would help! However, to be honest I find it already a bit confusing how kwargs are handled in xarray.backends.api. #8447 may eventually help having a clearer separation between common and backend-specific options.
This looks great to me!
I agree with adding this into xarray.open_dataset() and BackendEntrypoint.open_dataset().
For what it's worth, I think it's OK to require backend developers to update their code more frequently -- we don't need the same level of stabily that we need for user level APIs.
I just realized that if we turned off indexes by default, it would be a big win for the
open_mfdatasetcase in many cases.
I would love to see this merged so that I can try this out!
be aware that merging now will break compatibility with any 3rd party backend, which I believe is not something we should do, even if we think that the transition window can be shorter than usual.
I my eyes the easiest way forward would be:
- extend
Backend.open_datasetwith**kwargsas suggested by @max-sixty - wait a couple of versions (2?) to give backends the time to update and release
- add the new option
We don't have an easy way to contact all backend developers, unfortunately.
Edit: let's discuss in the meeting today
In the meeting just now we decided to inspect the signature of the backend's open_dataset method and not pass the new option if it doesn't support it nor accepts **kwargs.
We should still change the spec to require **kwargs (in a new PR), and maybe emit a deprecation warning for all backends that don't have it already.