xarray
xarray copied to clipboard
32- vs 64-bit coordinates coordinates in where()
What happened?
I'm struggling whether this is a bug or not. At least I faced a very unexpected behaviour.
For two given data arrays a and b with same dimensions and equal coordinates, c for c = a.where(b) should have equal dimensions and coordinates.
However if the coordinates of a have dtype of float32 and those of b are float64, then the dimension sizes of c will always be two. Of course, this way the coordinates of a and b are no longer exactly equal, but from a user perspective they represent the same labels.
The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension. Allowing to explicitly pass indexes may help here, see #6392.
What did you expect to happen?
In the case described above, the dimensions and coordinates of c should be equal to a (and b).
Minimal Complete Verifiable Example
import numpy as np
import xarray as xr
c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
c64 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')
c3 = c32.where(c64 > 0.5)
assert len(c32) == len(c3)
v32 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c32))
v64 = xr.DataArray(np.random.random(10), dims='x', coords=dict(x=c64))
v3 = v32.where(v64 > 0.5)
assert len(v32) == len(v3)
# --> Assertion error, Expected :10, Actual :2
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
No response
Environment
This does seem very odd. Does anyone have any ideas? As per @forman , changing
-c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float32), dims='x')
+c32 = xr.DataArray(np.linspace(0, 1, 10, dtype=np.float64), dims='x')
...causes the assertion to pass.
I'm not sure using floats as indexes is great, but I wouldn't have expected the results to be like this...
The behaviour is likely caused by the fact that the indexes generated for the coordinates are no longer strictly equal, therefore where() picks only the two outer cells of each dimension.
Yes that's right:
v32.indexes["x"].intersection(v64.indexes["x"])
# Float64Index([0.0, 1.0], dtype='float64', name='x')
I think the issue is more general than where() and relates to the alignment of Xarray objects with 32- vs 64-bit indexed coordinates:
xr.align(c32, c64)
# (<xarray.DataArray (x: 10)>
# array([0. , 0.11111111, 0.22222222, 0.33333334, 0.44444445,
# 0.5555556 , 0.6666667 , 0.7777778 , 0.8888889 , 1. ],
# dtype=float32)
# Dimensions without coordinates: x,
# <xarray.DataArray (x: 10)>
# array([0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
# 0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ])
# Dimensions without coordinates: x)
xr.align(v32.x, v64.x)
# (<xarray.DataArray 'x' (x: 2)>
# array([0., 1.], dtype=float32)
# Coordinates:
# * x (x) float64 0.0 1.0,
# <xarray.DataArray 'x' (x: 2)>
# array([0., 1.])
# Coordinates:
# * x (x) float64 0.0 1.0)
A possible solution would be to handle this special case internally by converting one of the index according to the dtype of the coordinate labels of the other index, similarly to what we are currently doing for the labels that are passed to .sel() (#3153). This should be pretty easy to implement in PandasIndex.join() I think.
However, I'm also wondering whether or not we should consider this as a bug. It would make sense to have a behavior that is consistent with .sel(), even though it is not a free nor transparent operation (implicit creation of a temporary pandas index). But how about .equals()? I.e.,
v32.x.equals(v64.x)
# False -- Should we return True here?
This would be quite weird and wouldn't match the Xarray, Pandas and Numpy behavior below:
v32.indexes["x"].equals(v64.indexes["x"])
# False
c64.equals(c32)
# False
np.all(c32.values == c64.values)
# False
It could be coherent to have:
v32.x.equals(v64.x)be false — the indexes themselves aren't the same- the join allow some float imprecision (similar to
method=nearest), which would conveniently allow cases like this to work
I could also imagine raising an error here and having the user coerce the type. That seems less surprising that the current situation. Other languages don't allow floats to be compared for equality at all...
Maybe we should add an explicit join kwarg, so the safe thing to specify is join="exact"
the join allow some float imprecision (similar to
method=nearest), which would conveniently allow cases like this to work
I like that.
I also like the idea of alignment with some tolerance. There is an open PR #4489, which needs to be reworked in the context of the explicit index refactor.
Alternatively to a new kwarg we could add an index build option, e.g., ds.set_xindex("x", index_cls=PandasIndex, align_tolerance=1e-6), but then it is not obvious how to handle different tolerance values given for the indexes to compare. Maybe this could depend on the given join method? E.g., pick the smallest tolerance for join=inner, the largest for join=outer, the tolerance of the left index for join=left, etc.