pint-xarray
pint-xarray copied to clipboard
implement the `PintMetaIndex`
As mentioned in #162, it is possible to get the indexing functions to work, although there still is no public API.
I also still don't quite understand how other methods work since the refactor, so this only implements sel
.
Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex
ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air
new_arr = xr.DataArray(
arr.variable,
coords={
"lat": arr.lat.variable,
"lon": arr.lon.variable,
"time": arr.time.variable,
},
indexes={
"lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
"lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
"time": arr.xindexes["time"],
},
fastpath=True,
)
new_arr.sel(
lat=ureg.Quantity([75, 70, 65], "deg"),
lon=ureg.Quantity([200, 202.5], "deg"),
)
This will fail at the moment because xarray
treats dask
arrays differently from duck-dask
arrays, but passing single values works!
- [x] Closes #162, closes #205, closes #218
- [ ] Tests added
- [x] Passes
pre-commit run --all-files
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst
- [ ] New functions/methods are listed in
api.rst
I also still don't quite understand how other methods work since the refactor, so this only implements sel.
Here are a few comments. Happy to answer questions if any.
There are some Index methods of like isel
, stack
, rename
that I guess do not depend on units and where you could just forward the call + args to the wrapped index.
For some other methods like concat
, join
, equals
, reindex
, the easiest would probably to require that the other indexes given as arguments must have the same units (and raise an error otherwise). An alternative option would be to implicitly convert the units of those indexes, but it's difficult to do that in an index-agnostic way unless we define and adopt in Xarray some protocol for that purpose.
The general approach used in the Xarray indexes refactor heavily relies on the type of the indexes (at least when we need to compare them together). That's not super flexible with the PintMetaIndex
solution here, but I think it's reasonable enough. For example, alignment will only work when the indexes found for a common set of coordinates / dimensions are all PintMetaIndex
objects. This might behave weirdly when the PintMetaIndex
objects do not wrap indexes of the same type, although this would be easy to check.
I wonder whether whether PintMetaIndex
should accept one unit or a map of units. The latter makes sense if, e.g., you want to reuse it to wrap multi-indexes where the corresponding coordinate variables (levels) may have different units.
Regarding Index methods like from_variables
and create_variables
, I guess PintMetaIndex
would implement a lightweight wrapping layer to respectively get and assign a pint quantity from / to the Xarray variables.
You should also be careful when converting the units of indexed coordinates as it may get out of sync with their index. As there's no concept of "duck" index, the easiest would probably be to drop the index (and maybe reconstruct it from scratch) when the coordinates are updated.
@keewis I have been looking into this once again and now I think I better understand what you'd like to achieve with the PintMetaIndex
wrapper. A few other suggestions:
Wrap index coordinate variables as unit-aware variables
I'm not familiar with pint, but if a pint.Quantity
works with any duck-array without coercing it into a specific type, I think that something like this may work:
class PintMetaIndex:
def create_variables(self, variables=None):
index_vars = self.index.create_variables(variables)
index_vars_units = {}
for name, var in index_vars.items():
data = array_attach_unit(var.data, self.units[name])
var_units = xr.Variable(var.dims, data, attrs=var.attrs, encoding=var.encoding)
index_vars_units[name] = var_units
return index_var_units
We cannot use IndexVariable since (if I remember well) it coerces the data as a pd.Index
. I think it is fine to stick with Variable for now, it will work in most cases except for, e.g., xr.testing.assert_identical()
that maybe checks for the variable type (although maybe check_default_indexes=False
may disable it). We'll need to change that in Xarray anyway.
Set new Pint index(es)
Since PintMetaIndex
is meant to wrap any existing index, it could be done via the accessors while quantifying the variables. Something like this might work:
@register_dataset_accessor("pint")
class PintDatasetAccessor:
def quantify(self, units=_default, unit_registry=None, **unit_kwargs)):
...
ds_xindexes = self.ds.xindexes
new_indexes, new_index_vars = ds_xindexes.copy_indexes()
for idx, idx_vars in ds_xindexes.group_by_index():
idx_units = {k: v for k, v in units.items() if k in idx_vars}
new_idx = PintMetaIndex(idx, idx_units)
new_indexes.update({k: new_idx for k in idx_vars})
new_index_vars.update(new_idx.create_variables(idx_vars))
new_coords = xr.Coordinates(new_index_vars, new_indexes)
# needs https://github.com/pydata/xarray/pull/8094 to work properly
ds_updated_temp = self.ds.assign_coords(new_coords)
...
It is still useful to implement PintMetaIndex.from_variables()
with the common PandasIndex case so that it also works with Dataset.set_xindex
:
class PintMetaIndex:
@classmethod
def from_variables(cls, variables, options):
index = xr.indexes.PandasIndex.from_variables(variables)
units_dict = {index.index.name: options.get("units")}
return = cls(index, units_dict)
ds = xr.Dataset(coords={"x": [1, 2]})
ds_units = ds.drop_indexes("x").set_xindex("x", PintMetaIndex, units="m")
Data selection
Beware that Index.sel()
may return new index objects along with the dimension integer indexers. the PintMetaIndex
wrapper will need to wrap them before returning the query results.
nit: I would rename PintMetaIndex
to just PintIndex
.
Further comments:
Implementing Dataset.pint.dequantify()
would look very much the same than in Dataset.pint.quantify()
except that it would extract the wrapped index: new_idx = self.index
.
For Dataset.pint.to()
, I think it is possible to convert the variables first, then re-create the underlying index with converted_idx = type(self.index).from_variables(converted_vars)
and then wrap the latter with PintMetaIndex(converted_idx, new_units)
. I don't think it is always a good idea, though, as re-building the underlying index from scratch may be expensive and may load lazy coordinate data.
@benbovy, with a few tweaks to your suggestions this:
In [1]: import xarray as xr
...: import pint_xarray
...:
...: ureg = pint_xarray.unit_registry
...: ds = xr.tutorial.open_dataset("air_temperature")
...: q = ds.pint.quantify({"lat": "degrees", "lon": "degrees"})
...: q.sel(lat=ureg.Quantity(75, "deg").to("rad"))
.../xarray/core/indexes.py:473: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
index = pd.Index(np.asarray(array), **kwargs)
Out[1]:
<xarray.Dataset>
Dimensions: (time: 2920, lon: 53)
Coordinates:
lat float32 [deg] 75.0
* lon (lon) float32 [deg] 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
* time (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
air (time, lon) float32 [K] 241.2 242.5 243.5 ... 241.48999 241.79
Indexes:
lon PintMetaIndex
time PintMetaIndex
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
does work :tada:
The only hickup is that somehow we seem to call safe_cast_to_index
, which gives us the warning above. This doesn't seem to affect the functionality of the index, though (at least for sel
, which is the only thing I tested).
(the failing tests are expected, I will have to update some of the workaround code)
Regarding all the errors in CI TypeError: <pint_xarray.index.PintMetaIndex object at ...> cannot be cast to a pandas.Index object
, an easy fix would be to implement PintMetaIndex.to_pandas_index()
and just call that method for the wrapped index. This could be done for most of the rest of the Index API I think.
would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?
In any case, I wonder if we should just fix diff_*
to not make use of obj.indexes
? That way, we could avoid the conversion.
would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?
In the case of .to_pandas_index()
just delegate is enough. For other methods (.rename, .roll, .isel, etc.) wrapping the result will also be needed. For .sel
it might also be needed to check and wrap any index returned within the IndexSelResults
.
In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.
Yes definitely, I think I fixed it in one of my open PRs in the Xarray repo.
Not sure if this was addressed:
For some other methods like concat, join, equals, reindex, the easiest would probably to require that the other indexes given as arguments must have the same units (and raise an error otherwise).
Isn't it addressed by pint.Quantity.to_base_unit
?
Isn't it addressed by
pint.Quantity.to_base_unit
?
No, because to_base_unit
will convert the ~index~ index's data to the base units (i.e. combinations of the base units m
, s
, kg
, ...). Instead, the idea is to require exactly matching units for now, which can be relaxed at a later point.
@benbovy, something I just noticed: where do we need PintIndex.from_variables
? As far as I understand it, this is used for set_xindex
, but since PintIndex
is a meta index (or rather, a index that wraps other indexes) it might not really make sense to pass it to set_xindex
... Would it be justified to not implement it, even though the base class says otherwise ("Every subclass must at least implement Index.from_variables
")?
Other than that, I'm not sure what to do about stack
/ unstack
: we're potentially stacking variables of different unit dimensions, so stack
might not make too much sense. Maybe we should automatically dequantify the stacked indexes to at least preserve the units? unstack
might work if we're wrapping a MultiIndex
, but I'd be leaning towards not supporting it. That would leave us with concat
, join
, and reindex_like
, which I'd skip for now.
I'll have a look at what needs to be changed to get the tests to work, although this would probably require changes to xarray
(the diff_*
functions still make use of .indexes
that call the index's .to_pandas_index
, but I don't want to implement .to_pandas_index
here to avoid unintentionally losing a unit).
we also need to figure out how to handle set_xindex
on top of a quantified dataset (so a PintIndex
would be overwritten – if that makes sense?)
I think that in general it would be nice if we can keep a simple and predictable behavior for .set_xindex
, i.e., we can use it with any Xarray Index leaf class and also it doesn't make any assumption about the index being overwritten (I think that currently this is not even possible to overwrite an existing index and .drop_indexes
must be called first).
I wouldn't make a special case for PintIndex or any other "meta-index". Although it is likely that in most cases pint indexes won't be set directly via .set_xindex
, it would make sense to me having PintIndex.from_variables
return a pint index wrapping a PandasIndex (or an object of the default index type used by Xarray).
Other than that, I'm not sure what to do about stack / unstack: we're potentially stacking variables of different unit dimensions, so stack might not make too much sense.
Hmm stack / unstack only act on the dimensions so in theory it shouldn't invalidate units? Since the pint_index.units
attribute is a mapping of coordinate name to unit, couldn't PintIndex already support multiple units? The core stack / unstack logic is executed by the wrapped index and PintIndex may be agnostic of the wrapped index type.
In the case of a wrapped PandasMultiIndex, the created dimension coordinate (tuple values) may not have a single unit. Perhaps best is to keep it unitless. Eventually that coordinate will be deprecated anyway.
That said, I think it's fine to leave stack / unstack unimplemented for now and implement them in follow-up PRs.
I don't want to implement .to_pandas_index here to avoid unintentionally losing a unit
Hmm that sounds a bit too restrictive to me. How about issuing a user warning instead? In the long-term we can probably get rid of .to_pandas_index
when the index refactor will be complete Xarray-wise (this method is mostly used internally for features that still rely on pandas.Index).
(or an object of the default index type used by Xarray)
is there a way to access this default index type? Otherwise it's probably fine to keep hard-coding PandasIndex
for now and fail if that doesn't work.
(I think that currently this is not even possible to overwrite an existing index and
.drop_indexes
must be called first)
Makes sense, though in that case to be extra friendly I wonder if we can customize the error message? Instead of recommending .drop_indexes
I'd like to recommend .pint.dequantify
(though it's probably fine if we keep that for later or just rely on the docs)
How about issuing a user warning instead?
This is what pint
does when cast to a numpy
array, but I'm not really happy with that, warnings are just too easy to ignore.
I'll see how far I can get without .to_pandas_index
, and if I encounter anything that truly wants to cast to a PandasIndex
I may have to reconsider.
couldn't
PintIndex
already support multiple units?
That might be true, I might have been too tired yesterday when writing this. I'll still postpone the implementation to a later PR, though.
is there a way to access this default index type?
Not yet (I was thinking too much ahead). For now it is PandasIndex but maybe later we could add some option in Xarray to get / set the default index type (e.g., for cases where PandasIndex is too expensive and/or not really needed).
I'll see how far I can get without .to_pandas_index
You might want to see where it is called in Xarray internals. For example, if you leave PintIndex,to_pandas_index()
unimplemented, the ds.indexes
property will raise an error if there is any pint index in the dataset.
I wonder if we can customize the error message?
This would require some API entrypoint in Xarray, but I'm not sure it is really worth it since here you might recommend setting pint indexes "automatically" via .pint.quantify
instead of manually via .set_xindex
(but keep support the latter for consistency and advanced use cases).
now that I have more time to work on this, I think the only things left to fix right now is:
- [ ] figure out what to do with
assert_identical
. The point of usingassert_identical
was to make sure the units on dimension coordinates survived, but now that they are on indexes we can maybe just remove that check / useassert_equal
after applyingstrip_units
to actual - [ ] figure out how to get rid of the
UnitsStrippedWarning
emitted by.pint.to
, which happens becauseDataset
/Coordinates
tries to create default indexes on quantified variables when assembling the converted variables back into a dataset indataset_from_variables
. To resolve this, I think we need to somehow figure out how to recreate the underlying index from the converted variables.
The remaining methods I think we can leave to separate PRs: concat
, stack
, unstack
, join
, reindex_like
.
figure out how to get rid of the UnitsStrippedWarning emitted by .pint.to, which happens because Dataset / Coordinates tries to create default indexes on quantified variables when assembling the converted variables back into a dataset in dataset_from_variables.
For this one it would be nice if dataset_from_variables
could be refactored to allow passing a Coordinates
object (coordinate variables + indexes) that we can then pass the the Dataset constructor (maybe after updating it) so that it will skip the creation of new default indexes. The only thing is that I don't remember if Coordinates
was already part of Xarray's public API before this new behavior was implemented (you might need to temporarily rely on importing from xarray.core
).
we're already using some part of the private API of xarray
(Coordinates._construct_direct
), so I would be fine to use Coordinates
either way. However, from my experiments it appears just passing the variables to the constructor of that class still creates the default indexes. Do you have any hints on how to avoid that? Is that Coordinates._construct_direct
again?
Edit: oh, wait, maybe that's what you've been saying? I'll check.
You'd need to pass Coordinates(variables={...}, indexes={})
to skip the creation of default indexes, or Coordinates._contruct_direct
should work too (see https://github.com/pydata/xarray/pull/8107).
Coordinates._construct_direct
works, but Coordinates(coords=coords, indexes={})
still tries to create default indexes. Not sure why.
Edit: It appears as_variable
converts to an index variable.
Ah yes, see https://github.com/pydata/xarray/pull/8107/files#r1311214263
a couple more notes:
- the
diff
called byassert_identical
appears to "work" just by changinga.indexes
toa.xindexes
(it doesn't raise, at least). However,Dataset.equals
andDataset.identical
don't compare the indexes, yet, and the diff doesn't display differing indexes, either. - looking at
dataset_from_variables
, I think we can make that a lot simpler by passing along the original dataset, the new variables and the new indexes (if any)
looks like @TomNicholas' efforts to allow creating xarray
objects from Coordinates
objects without indexes fixed a lot of the issues I had here, as well. One of the few things left is to get the diff to not convert everything to pandas
indexes.
with this, all the tests pass locally (except the doctests, which I didn't try yet). ~For them to pass in CI as well, we need a released version of xarray
's main branch (otherwise the indexes will be cast to pandas indexes in the tests).~ looks like for some reason they pass with the released xarray
as well.
with the recent commits, this should be ready for reviews (cc @TomNicholas, @benbovy, @jthielen).
Note that even though the index implements a couple of the methods of the custom index base class, the wrapper methods (.pint.*
) should still be preferred. The reason for that is that these implementations are not yet complete: for example, PintIndex.sel
should modify the IndexSelResult
object returned by the wrapped index. This will be fixed in follow-up PRs, as this PR is already way too big.
So right now
indexes
is a valid argument to theDataArray
constructor but not to theDataset
constructor?
What exactly are you referring to? Not sure if this is it, but the entire code base is built upon call_on_dataset
, which (just like much of the methods on DataArray
) converts to a temporary dataset, applies the function, then converts back.
What exactly are you referring to?
I was actually referring to your top-level example in your github comment:
Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex
ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air
new_arr = xr.DataArray(
arr.variable,
coords={
"lat": arr.lat.variable,
"lon": arr.lon.variable,
"time": arr.time.variable,
},
indexes={
"lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
"lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
"time": arr.xindexes["time"],
},
fastpath=True,
)
new_arr.sel(
lat=ureg.Quantity([75, 70, 65], "deg"),
lon=ureg.Quantity([200, 202.5], "deg"),
)
This will fail at the moment because xarray
treats dask
arrays differently from duck-dask
arrays, but passing single values works!
But that seems to be the case when I look at the DataArray
source.
So right now indexes is a valid argument to the DataArray constructor but not to the Dataset constructor?
Yes but this is really for internal use along with the fastpath
argument. Ideally we might want to remove both and have a fast-path factory instead for DataArray
, but last time I checked it looked less straightforward to refactor than it seemed.
Great to see this ready @keewis ! I went through the changes and it looks good to me (cannot tell much about the pint-specific logic, though).