pint-xarray icon indicating copy to clipboard operation
pint-xarray copied to clipboard

implement the `PintMetaIndex`

Open keewis opened this issue 2 years ago • 23 comments

As mentioned in #162, it is possible to get the indexing functions to work, although there still is no public API.

I also still don't quite understand how other methods work since the refactor, so this only implements sel.

Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex

ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air

new_arr = xr.DataArray(
    arr.variable,
    coords={
        "lat": arr.lat.variable,
        "lon": arr.lon.variable,
        "time": arr.time.variable,
    },
    indexes={
        "lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
        "lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
        "time": arr.xindexes["time"],
    },
    fastpath=True,
)
new_arr.sel(
    lat=ureg.Quantity([75, 70, 65], "deg"),
    lon=ureg.Quantity([200, 202.5], "deg"),
)

This will fail at the moment because xarray treats dask arrays differently from duck-dask arrays, but passing single values works!

  • [x] Closes #162, closes #205, closes #218
  • [ ] Tests added
  • [x] Passes pre-commit run --all-files
  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst

keewis avatar Mar 26 '22 00:03 keewis

I also still don't quite understand how other methods work since the refactor, so this only implements sel.

Here are a few comments. Happy to answer questions if any.

There are some Index methods of like isel, stack, rename that I guess do not depend on units and where you could just forward the call + args to the wrapped index.

For some other methods like concat, join, equals, reindex, the easiest would probably to require that the other indexes given as arguments must have the same units (and raise an error otherwise). An alternative option would be to implicitly convert the units of those indexes, but it's difficult to do that in an index-agnostic way unless we define and adopt in Xarray some protocol for that purpose.

The general approach used in the Xarray indexes refactor heavily relies on the type of the indexes (at least when we need to compare them together). That's not super flexible with the PintMetaIndex solution here, but I think it's reasonable enough. For example, alignment will only work when the indexes found for a common set of coordinates / dimensions are all PintMetaIndex objects. This might behave weirdly when the PintMetaIndex objects do not wrap indexes of the same type, although this would be easy to check.

I wonder whether whether PintMetaIndex should accept one unit or a map of units. The latter makes sense if, e.g., you want to reuse it to wrap multi-indexes where the corresponding coordinate variables (levels) may have different units.

Regarding Index methods like from_variables and create_variables, I guess PintMetaIndex would implement a lightweight wrapping layer to respectively get and assign a pint quantity from / to the Xarray variables.

You should also be careful when converting the units of indexed coordinates as it may get out of sync with their index. As there's no concept of "duck" index, the easiest would probably be to drop the index (and maybe reconstruct it from scratch) when the coordinates are updated.

benbovy avatar Mar 28 '22 13:03 benbovy

@keewis I have been looking into this once again and now I think I better understand what you'd like to achieve with the PintMetaIndex wrapper. A few other suggestions:

Wrap index coordinate variables as unit-aware variables

I'm not familiar with pint, but if a pint.Quantity works with any duck-array without coercing it into a specific type, I think that something like this may work:

class PintMetaIndex:

    def create_variables(self, variables=None):
        index_vars = self.index.create_variables(variables)

        index_vars_units = {}
        for name, var in index_vars.items():
            data = array_attach_unit(var.data, self.units[name])
            var_units = xr.Variable(var.dims, data, attrs=var.attrs, encoding=var.encoding)
            index_vars_units[name] = var_units
        
        return index_var_units

We cannot use IndexVariable since (if I remember well) it coerces the data as a pd.Index. I think it is fine to stick with Variable for now, it will work in most cases except for, e.g., xr.testing.assert_identical() that maybe checks for the variable type (although maybe check_default_indexes=False may disable it). We'll need to change that in Xarray anyway.

Set new Pint index(es)

Since PintMetaIndex is meant to wrap any existing index, it could be done via the accessors while quantifying the variables. Something like this might work:

@register_dataset_accessor("pint")
class PintDatasetAccessor:

    def quantify(self, units=_default, unit_registry=None, **unit_kwargs)):
        ...

        ds_xindexes = self.ds.xindexes
        new_indexes, new_index_vars = ds_xindexes.copy_indexes()

        for idx, idx_vars in ds_xindexes.group_by_index():
            idx_units = {k: v for k, v in units.items() if k in idx_vars}
            new_idx = PintMetaIndex(idx, idx_units)
            new_indexes.update({k: new_idx for k in idx_vars})
            new_index_vars.update(new_idx.create_variables(idx_vars))

        new_coords = xr.Coordinates(new_index_vars, new_indexes)

        # needs https://github.com/pydata/xarray/pull/8094 to work properly
        ds_updated_temp = self.ds.assign_coords(new_coords)

        ...

It is still useful to implement PintMetaIndex.from_variables() with the common PandasIndex case so that it also works with Dataset.set_xindex:

class PintMetaIndex:
    
    @classmethod
    def from_variables(cls, variables, options):
        index = xr.indexes.PandasIndex.from_variables(variables)
        units_dict = {index.index.name: options.get("units")}
        return = cls(index, units_dict)


ds = xr.Dataset(coords={"x": [1, 2]})

ds_units = ds.drop_indexes("x").set_xindex("x", PintMetaIndex, units="m")

Data selection

Beware that Index.sel() may return new index objects along with the dimension integer indexers. the PintMetaIndex wrapper will need to wrap them before returning the query results.

benbovy avatar Aug 29 '23 07:08 benbovy

nit: I would rename PintMetaIndex to just PintIndex.

benbovy avatar Aug 29 '23 07:08 benbovy

Further comments:

Implementing Dataset.pint.dequantify() would look very much the same than in Dataset.pint.quantify() except that it would extract the wrapped index: new_idx = self.index.

For Dataset.pint.to(), I think it is possible to convert the variables first, then re-create the underlying index with converted_idx = type(self.index).from_variables(converted_vars) and then wrap the latter with PintMetaIndex(converted_idx, new_units). I don't think it is always a good idea, though, as re-building the underlying index from scratch may be expensive and may load lazy coordinate data.

benbovy avatar Aug 29 '23 08:08 benbovy

@benbovy, with a few tweaks to your suggestions this:

In [1]: import xarray as xr
   ...: import pint_xarray
   ...: 
   ...: ureg = pint_xarray.unit_registry
   ...: ds = xr.tutorial.open_dataset("air_temperature")
   ...: q = ds.pint.quantify({"lat": "degrees", "lon": "degrees"})
   ...: q.sel(lat=ureg.Quantity(75, "deg").to("rad"))
.../xarray/core/indexes.py:473: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  index = pd.Index(np.asarray(array), **kwargs)
Out[1]: 
<xarray.Dataset>
Dimensions:  (time: 2920, lon: 53)
Coordinates:
    lat      float32 [deg] 75.0
  * lon      (lon) float32 [deg] 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 2013-01-01 ... 2014-12-31T18:00:00
Data variables:
    air      (time, lon) float32 [K] 241.2 242.5 243.5 ... 241.48999 241.79
Indexes:
    lon      PintMetaIndex
    time     PintMetaIndex
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

does work :tada:

The only hickup is that somehow we seem to call safe_cast_to_index, which gives us the warning above. This doesn't seem to affect the functionality of the index, though (at least for sel, which is the only thing I tested).

keewis avatar Sep 13 '23 09:09 keewis

(the failing tests are expected, I will have to update some of the workaround code)

keewis avatar Sep 13 '23 09:09 keewis

Regarding all the errors in CI TypeError: <pint_xarray.index.PintMetaIndex object at ...> cannot be cast to a pandas.Index object, an easy fix would be to implement PintMetaIndex.to_pandas_index() and just call that method for the wrapped index. This could be done for most of the rest of the Index API I think.

benbovy avatar Sep 19 '23 11:09 benbovy

would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?

In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.

keewis avatar Sep 19 '23 11:09 keewis

would that just delegate to the underlying index, or also wrap it (probably the former, but I wanted to make sure)?

In the case of .to_pandas_index() just delegate is enough. For other methods (.rename, .roll, .isel, etc.) wrapping the result will also be needed. For .sel it might also be needed to check and wrap any index returned within the IndexSelResults.

In any case, I wonder if we should just fix diff_* to not make use of obj.indexes? That way, we could avoid the conversion.

Yes definitely, I think I fixed it in one of my open PRs in the Xarray repo.

benbovy avatar Sep 19 '23 11:09 benbovy

Not sure if this was addressed:

For some other methods like concat, join, equals, reindex, the easiest would probably to require that the other indexes given as arguments must have the same units (and raise an error otherwise).

Isn't it addressed by pint.Quantity.to_base_unit?

LecrisUT avatar Dec 07 '23 22:12 LecrisUT

Isn't it addressed by pint.Quantity.to_base_unit?

No, because to_base_unit will convert the ~index~ index's data to the base units (i.e. combinations of the base units m, s, kg, ...). Instead, the idea is to require exactly matching units for now, which can be relaxed at a later point.

keewis avatar Dec 09 '23 22:12 keewis

@benbovy, something I just noticed: where do we need PintIndex.from_variables? As far as I understand it, this is used for set_xindex, but since PintIndex is a meta index (or rather, a index that wraps other indexes) it might not really make sense to pass it to set_xindex... Would it be justified to not implement it, even though the base class says otherwise ("Every subclass must at least implement Index.from_variables")?

Other than that, I'm not sure what to do about stack / unstack: we're potentially stacking variables of different unit dimensions, so stack might not make too much sense. Maybe we should automatically dequantify the stacked indexes to at least preserve the units? unstack might work if we're wrapping a MultiIndex, but I'd be leaning towards not supporting it. That would leave us with concat, join, and reindex_like, which I'd skip for now.

I'll have a look at what needs to be changed to get the tests to work, although this would probably require changes to xarray (the diff_* functions still make use of .indexes that call the index's .to_pandas_index, but I don't want to implement .to_pandas_index here to avoid unintentionally losing a unit).

keewis avatar Dec 10 '23 20:12 keewis

we also need to figure out how to handle set_xindex on top of a quantified dataset (so a PintIndex would be overwritten – if that makes sense?)

keewis avatar Dec 10 '23 21:12 keewis

I think that in general it would be nice if we can keep a simple and predictable behavior for .set_xindex, i.e., we can use it with any Xarray Index leaf class and also it doesn't make any assumption about the index being overwritten (I think that currently this is not even possible to overwrite an existing index and .drop_indexes must be called first).

I wouldn't make a special case for PintIndex or any other "meta-index". Although it is likely that in most cases pint indexes won't be set directly via .set_xindex, it would make sense to me having PintIndex.from_variables return a pint index wrapping a PandasIndex (or an object of the default index type used by Xarray).

Other than that, I'm not sure what to do about stack / unstack: we're potentially stacking variables of different unit dimensions, so stack might not make too much sense.

Hmm stack / unstack only act on the dimensions so in theory it shouldn't invalidate units? Since the pint_index.units attribute is a mapping of coordinate name to unit, couldn't PintIndex already support multiple units? The core stack / unstack logic is executed by the wrapped index and PintIndex may be agnostic of the wrapped index type.

In the case of a wrapped PandasMultiIndex, the created dimension coordinate (tuple values) may not have a single unit. Perhaps best is to keep it unitless. Eventually that coordinate will be deprecated anyway.

That said, I think it's fine to leave stack / unstack unimplemented for now and implement them in follow-up PRs.

I don't want to implement .to_pandas_index here to avoid unintentionally losing a unit

Hmm that sounds a bit too restrictive to me. How about issuing a user warning instead? In the long-term we can probably get rid of .to_pandas_index when the index refactor will be complete Xarray-wise (this method is mostly used internally for features that still rely on pandas.Index).

benbovy avatar Dec 11 '23 13:12 benbovy

(or an object of the default index type used by Xarray)

is there a way to access this default index type? Otherwise it's probably fine to keep hard-coding PandasIndex for now and fail if that doesn't work.

(I think that currently this is not even possible to overwrite an existing index and .drop_indexes must be called first)

Makes sense, though in that case to be extra friendly I wonder if we can customize the error message? Instead of recommending .drop_indexes I'd like to recommend .pint.dequantify (though it's probably fine if we keep that for later or just rely on the docs)

How about issuing a user warning instead?

This is what pint does when cast to a numpy array, but I'm not really happy with that, warnings are just too easy to ignore.

I'll see how far I can get without .to_pandas_index, and if I encounter anything that truly wants to cast to a PandasIndex I may have to reconsider.

couldn't PintIndex already support multiple units?

That might be true, I might have been too tired yesterday when writing this. I'll still postpone the implementation to a later PR, though.

keewis avatar Dec 11 '23 14:12 keewis

is there a way to access this default index type?

Not yet (I was thinking too much ahead). For now it is PandasIndex but maybe later we could add some option in Xarray to get / set the default index type (e.g., for cases where PandasIndex is too expensive and/or not really needed).

I'll see how far I can get without .to_pandas_index

You might want to see where it is called in Xarray internals. For example, if you leave PintIndex,to_pandas_index() unimplemented, the ds.indexes property will raise an error if there is any pint index in the dataset.

I wonder if we can customize the error message?

This would require some API entrypoint in Xarray, but I'm not sure it is really worth it since here you might recommend setting pint indexes "automatically" via .pint.quantify instead of manually via .set_xindex (but keep support the latter for consistency and advanced use cases).

benbovy avatar Dec 11 '23 16:12 benbovy

now that I have more time to work on this, I think the only things left to fix right now is:

  • [ ] figure out what to do with assert_identical. The point of using assert_identical was to make sure the units on dimension coordinates survived, but now that they are on indexes we can maybe just remove that check / use assert_equal after applying strip_units to actual
  • [ ] figure out how to get rid of the UnitsStrippedWarning emitted by .pint.to, which happens because Dataset / Coordinates tries to create default indexes on quantified variables when assembling the converted variables back into a dataset in dataset_from_variables. To resolve this, I think we need to somehow figure out how to recreate the underlying index from the converted variables.

The remaining methods I think we can leave to separate PRs: concat, stack, unstack, join, reindex_like.

keewis avatar Dec 21 '23 15:12 keewis

figure out how to get rid of the UnitsStrippedWarning emitted by .pint.to, which happens because Dataset / Coordinates tries to create default indexes on quantified variables when assembling the converted variables back into a dataset in dataset_from_variables.

For this one it would be nice if dataset_from_variables could be refactored to allow passing a Coordinates object (coordinate variables + indexes) that we can then pass the the Dataset constructor (maybe after updating it) so that it will skip the creation of new default indexes. The only thing is that I don't remember if Coordinates was already part of Xarray's public API before this new behavior was implemented (you might need to temporarily rely on importing from xarray.core).

benbovy avatar Dec 21 '23 15:12 benbovy

we're already using some part of the private API of xarray (Coordinates._construct_direct), so I would be fine to use Coordinates either way. However, from my experiments it appears just passing the variables to the constructor of that class still creates the default indexes. Do you have any hints on how to avoid that? Is that Coordinates._construct_direct again?

Edit: oh, wait, maybe that's what you've been saying? I'll check.

keewis avatar Dec 21 '23 15:12 keewis

You'd need to pass Coordinates(variables={...}, indexes={}) to skip the creation of default indexes, or Coordinates._contruct_direct should work too (see https://github.com/pydata/xarray/pull/8107).

benbovy avatar Dec 21 '23 15:12 benbovy

Coordinates._construct_direct works, but Coordinates(coords=coords, indexes={}) still tries to create default indexes. Not sure why.

Edit: It appears as_variable converts to an index variable.

keewis avatar Dec 21 '23 15:12 keewis

Ah yes, see https://github.com/pydata/xarray/pull/8107/files#r1311214263

benbovy avatar Dec 21 '23 15:12 benbovy

a couple more notes:

  • the diff called by assert_identical appears to "work" just by changing a.indexes to a.xindexes (it doesn't raise, at least). However, Dataset.equals and Dataset.identical don't compare the indexes, yet, and the diff doesn't display differing indexes, either.
  • looking at dataset_from_variables, I think we can make that a lot simpler by passing along the original dataset, the new variables and the new indexes (if any)

keewis avatar Dec 22 '23 12:12 keewis

looks like @TomNicholas' efforts to allow creating xarray objects from Coordinates objects without indexes fixed a lot of the issues I had here, as well. One of the few things left is to get the diff to not convert everything to pandas indexes.

keewis avatar Jun 23 '24 16:06 keewis

with this, all the tests pass locally (except the doctests, which I didn't try yet). ~For them to pass in CI as well, we need a released version of xarray's main branch (otherwise the indexes will be cast to pandas indexes in the tests).~ looks like for some reason they pass with the released xarray as well.

keewis avatar Jul 06 '24 14:07 keewis

with the recent commits, this should be ready for reviews (cc @TomNicholas, @benbovy, @jthielen).

Note that even though the index implements a couple of the methods of the custom index base class, the wrapper methods (.pint.*) should still be preferred. The reason for that is that these implementations are not yet complete: for example, PintIndex.sel should modify the IndexSelResult object returned by the wrapped index. This will be fixed in follow-up PRs, as this PR is already way too big.

keewis avatar Jul 06 '24 14:07 keewis

So right now indexes is a valid argument to the DataArray constructor but not to the Dataset constructor?

What exactly are you referring to? Not sure if this is it, but the entire code base is built upon call_on_dataset, which (just like much of the methods on DataArray) converts to a temporary dataset, applies the function, then converts back.

keewis avatar Jul 08 '24 21:07 keewis

What exactly are you referring to?

I was actually referring to your top-level example in your github comment:

Usage, for anyone who wants to play around with it
import xarray as xr
from pint_xarray.index import PintMetaIndex

ds = xr.tutorial.open_dataset("air_temperature")
arr = ds.air

new_arr = xr.DataArray(
    arr.variable,
    coords={
        "lat": arr.lat.variable,
        "lon": arr.lon.variable,
        "time": arr.time.variable,
    },
    indexes={
        "lat": PintMetaIndex(arr.xindexes["lat"], {"lat": arr.lat.attrs.get("units")}),
        "lon": PintMetaIndex(arr.xindexes["lon"], {"lon": arr.lon.attrs.get("units")}),
        "time": arr.xindexes["time"],
    },
    fastpath=True,
)
new_arr.sel(
    lat=ureg.Quantity([75, 70, 65], "deg"),
    lon=ureg.Quantity([200, 202.5], "deg"),
)

This will fail at the moment because xarray treats dask arrays differently from duck-dask arrays, but passing single values works!

But that seems to be the case when I look at the DataArray source.

TomNicholas avatar Jul 08 '24 23:07 TomNicholas

So right now indexes is a valid argument to the DataArray constructor but not to the Dataset constructor?

Yes but this is really for internal use along with the fastpath argument. Ideally we might want to remove both and have a fast-path factory instead for DataArray, but last time I checked it looked less straightforward to refactor than it seemed.

benbovy avatar Jul 09 '24 07:07 benbovy

Great to see this ready @keewis ! I went through the changes and it looks good to me (cannot tell much about the pint-specific logic, though).

benbovy avatar Jul 09 '24 07:07 benbovy