xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Add `fill` to `.sel` for missing values.

Open asford opened this issue 10 months ago • 5 comments

Is your feature request related to a problem?

It would be ideal if the .sel accessors provided a fill mode that nan-fills, rather than errors, when a value isn't present in the target index.

This often occurs when a Dataset or DataArray is indexed by an arbitrary categorial value, and we want to index against that value by an arbitrarily-dimensioned coordinate from another array. If the coordinate is missing in the target index I want a (conceptual) outer join, rather than erroring.

It would be useful to add a fill_value parameter that's used like fillna so that these missing indices can be filled with user-provided value, however this might be dangerous because .sel already accepts kwargs. Constraining this to when it's explicitly requested via mode="fill" or the other fill modes would make this safer.

This is related to https://github.com/pydata/xarray/issues/4995

Describe the solution you'd like

Adding a fill mode to sel, which performs nan-filling, rather than raising a KeyError, when values are missing from the index. If mode="fill" is specified, allow an additional fill_value kwarg which specifies the fill value with the same semantics as fillna.

Potentially allow this filling logic for other modes that would currently raise a KeyError on failed fill operations.

This is a restricted, single-dimension, inefficient workaround that implements this logic for a single indexing dimension:

def sel_fill(
    left: XA,
    dim: str,
    right_coord: xa.DataArray,
    fill_value: Any | dict[str, Any] = xa.core.dtypes.NA,  # type: ignore
) -> XA:
    """.sel but return nan/_fill for missing values.

    This is a "right join" when indexing a dim with a coord array.
    For example, for "data" with a dim matching a coord of "samples":

    `data.sel(name=samples.sample_name)` is a strict join...
    ...if a value in sample_name isn't present in data.name it's an error.
    However, you may want to select-and-fill-missing-values...
    ...if a value in present in data.name, then select data.
    ...otherwise return a nan or filled value.

    This uses the same filling logic as xa.align,
    provide a dictionary of names to fill with specific values.

    """
    assert dim in left.indexes
    index: pd.Index = left.indexes[dim]

    coord_index = pd.Index(right_coord.values.ravel())

    missing_values = coord_index.difference(index)

    if missing_values.empty:
        left_data = left
    else:
        left_data = xa.concat(
            [
                left,
                left.reindex({dim: missing_values}, fill_value=fill_value),
            ],
            dim=dim,
        )

    return left_data.sel({dim: right_coord})

Describe alternatives you've considered

  • .sel - Raises a KeyError if an coordinate isn't present in the index. Can perform fill for some numeric indexes, but doesn't have a fill operation for discrete indexes.
  • .reindex - Functions when the "query" or "right" coordinate has the same dims as the "value" or "left" data, however we can't reindex against coordinates with different dimensionality that the value coordinate.

Additional context

Semi-MVP repro:

import xarray as xa
import numpy as np
import pandas as pd
from typing import Any

def sel_fill(
    left: xa.DataArray,
    dim: str,
    right_coord: xa.DataArray,
    fill_value: Any | dict[str, Any] = xa.core.dtypes.NA,  # type: ignore
) -> xa.DataArray:
    """.sel but return nan/_fill for missing values.

    This is a "right join" when indexing a dim with a coord array.
    For example, for "data" with a dim matching a coord of "samples":

    `data.sel(name=samples.sample_name)` is a strict join...
    ...if a value in sample_name isn't present in data.name it's an error.
    However, you may want to select-and-fill-missing-values...
    ...if a value in present in data.name, then select data.
    ...otherwise return a nan or filled value.

    This uses the same filling logic as xa.align,
    provide a dictionary of names to fill with specific values.

    """
    assert dim in left.indexes
    index: pd.Index = left.indexes[dim]

    coord_index = pd.Index(right_coord.values.ravel())

    missing_values = coord_index.difference(index)

    if missing_values.empty:
        left_data = left
    else:
        left_data = xa.concat(
            [
                left,
                left.reindex({dim: missing_values}, fill_value=fill_value),
            ],
            dim=dim,
        )

    return left_data.sel({dim: right_coord})

dat = xa.DataArray(np.arange(4), dims="letter").assign_coords(letter=list("abcd"))
query = xa.DataArray([["a", "b", "c"], ["d", "e", "f"]])

# KeyError: "not all values found in index 'letter'"
dat.sel(letter=query)

# ValueError: Indexer has dimensions ('dim_0', 'dim_1') that are different from that to be indexed along 'letter'
dat.reindex(dict(letter=query))

# Just-right
sel_fill(dat, "letter", query, fill_value=1663)
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[   0,    1,    2],
       [   3, 1663, 1663]])
Coordinates:
    letter   (dim_0, dim_1) object 'a' 'b' 'c' 'd' 'e' 'f'
Dimensions without coordinates: dim_0, dim_1

asford avatar Mar 26 '24 22:03 asford

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

welcome[bot] avatar Mar 26 '24 22:03 welcome[bot]

IIUC we call this reindex: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.reindex.html

Do you think the docs for sel can be improved to point this out?

dcherian avatar Mar 27 '24 17:03 dcherian

reindex is close to the behavior we want, but mandates that the indexer has the same dims as the target. In this operation, we're indexing along a single dim with a multi-dimensional coord. So we expect the selection + fill to result in a differently-dimensioned structure.

I've put a gist up at: https://gist.github.com/asford/b10a9f62b3657be7a122b3411e90ba2a

I may not be invoking this correctly, happy to further guidance.

asford avatar Mar 28 '24 01:03 asford

Clarified in description, perhaps an adaptation of reindex rather than sel? The naïve dat.reindex(letter=query.stack(letter=query.dims)) breaks because reindex ends up replacing the coord index (which we want to preserve) with the indexing values.

asford avatar Mar 28 '24 01:03 asford

perhaps an adaptation of reindex rather than sel?

I think this would make more sense indeed. Similarly to advanced indexing, advanced reindexing might perhaps be done by passing xarray Variable or DataArray objects as indexer mapping values to .reindex() and checking their dimensions. Not sure about the amount of internal refactoring this would require, though.

benbovy avatar Mar 28 '24 09:03 benbovy