xarray 32- vs 64-bit coordinates coordinates in where()

trafficstars

What happened?

I'm struggling whether this is a bug or not. At least I faced a very unexpected behaviour.

For two given data arrays a and b with same dimensions and equal coordinates, c for c = a.where(b) should have equal dimensions and coordinates.

However if the coordinates of a have dtype of float32 and those of b are float64, then the dimension sizes of c will always be two. Of course, this way the coordinates of a and b are no longer exactly equal, but from a user perspective they represent the same labels.

The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension. Allowing to explicitly pass indexes may help here, see #6392.

What did you expect to happen?

In the case described above, the dimensions and coordinates of c should be equal to a (and b).

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
c64 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')

c3 = c32.where(c64 > 0.5)
assert len(c32) == len(c3)

v32 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c32))
v64 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c64))

v3 = v32.where(v64 > 0.5)
assert len(v32) == len(v3)
# --> Assertion error, Expected :10, Actual :2

MVCE confirmation

[X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
[X] Complete example — the example is self-contained, including all data and the text of any traceback.
[X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
[X] New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:17:03) [MSC v.1929 64 bit (AMD64)] python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: AMD64 Family 25 Model 80 Stepping 0, AuthenticAMD byteorder: little LC_ALL: None LANG: None LOCALE: ('de_DE', 'cp1252') libhdf5: 1.12.1 libnetcdf: 4.8.1 xarray: 2022.3.0 pandas: 1.4.2 numpy: 1.21.6 scipy: 1.8.0 netCDF4: 1.5.8 pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.11.3 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2022.04.1 distributed: 2022.4.1 matplotlib: 3.5.1 cartopy: None seaborn: None numbagg: None fsspec: 2022.3.0 cupy: None pint: None sparse: None setuptools: 62.1.0 pip: 22.0.4 conda: None pytest: 7.1.2 IPython: 8.2.0 sphinx: None

May 05 '22 06:05 forman

This does seem very odd. Does anyone have any ideas? As per @forman , changing

-c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
+c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')

...causes the assertion to pass.

I'm not sure using floats as indexes is great, but I wouldn't have expected the results to be like this...

May 05 '22 07:05 max-sixty

The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension.

Yes that's right:

v32.indexes["x"].intersection(v64.indexes["x"])
# Float64Index([0.0, 1.0], dtype='float64', name='x')

I think the issue is more general than where() and relates to the alignment of Xarray objects with 32- vs 64-bit indexed coordinates:

xr.align(c32, c64)
# (<xarray.DataArray (x: 10)>
#  array([0.        , 0.11111111, 0.22222222, 0.33333334, 0.44444445,
#         0.5555556 , 0.6666667 , 0.7777778 , 0.8888889 , 1.        ],
#        dtype=float32)
#  Dimensions without coordinates: x,
#  <xarray.DataArray (x: 10)>
#  array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
#         0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])
#  Dimensions without coordinates: x)

xr.align(v32.x, v64.x)
# (<xarray.DataArray 'x' (x: 2)>
#  array([0., 1.], dtype=float32)
#  Coordinates:
#    * x        (x) float64 0.0 1.0,
#  <xarray.DataArray 'x' (x: 2)>
#  array([0., 1.])
#  Coordinates:
#    * x        (x) float64 0.0 1.0)

A possible solution would be to handle this special case internally by converting one of the index according to the dtype of the coordinate labels of the other index, similarly to what we are currently doing for the labels that are passed to .sel() (#3153). This should be pretty easy to implement in PandasIndex.join() I think.

However, I'm also wondering whether or not we should consider this as a bug. It would make sense to have a behavior that is consistent with .sel(), even though it is not a free nor transparent operation (implicit creation of a temporary pandas index). But how about .equals()? I.e.,

v32.x.equals(v64.x)
# False  -- Should we return True here?

This would be quite weird and wouldn't match the Xarray, Pandas and Numpy behavior below:

v32.indexes["x"].equals(v64.indexes["x"])
# False

c64.equals(c32)
# False

np.all(c32.values == c64.values)
# False

May 05 '22 20:05 benbovy

It could be coherent to have:

v32.x.equals(v64.x) be false — the indexes themselves aren't the same
the join allow some float imprecision (similar to method=nearest), which would conveniently allow cases like this to work

I could also imagine raising an error here and having the user coerce the type. That seems less surprising that the current situation. Other languages don't allow floats to be compared for equality at all...

May 05 '22 21:05 max-sixty

Maybe we should add an explicit join kwarg, so the safe thing to specify is join="exact"

May 05 '22 21:05 dcherian

the join allow some float imprecision (similar to method=nearest), which would conveniently allow cases like this to work

I like that.

May 16 '22 10:05 forman

I also like the idea of alignment with some tolerance. There is an open PR #4489, which needs to be reworked in the context of the explicit index refactor.

Alternatively to a new kwarg we could add an index build option, e.g., ds.set_xindex("x", index_cls=PandasIndex, align_tolerance=1e-6), but then it is not obvious how to handle different tolerance values given for the indexes to compare. Maybe this could depend on the given join method? E.g., pick the smallest tolerance for join=inner, the largest for join=outer, the tolerance of the left index for join=left, etc.

Sep 28 '22 08:09 benbovy

xarray xarray copied to clipboard

32- vs 64-bit coordinates coordinates in where()

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

xarray
xarray copied to clipboard