xarray
xarray copied to clipboard
DataArray.where() can truncate strings with `<U` dtypes
What happened?
I want to replace all "=" occurrences in an xr.DataArray called sign with "<=".
sign_c = sign.where(sign != "=", "<=")
The resulting DataArray then does not contain "<=" though, but "<". This only happens if sign only has "=" entries.
What did you expect to happen?
That all "=" occurrences in sign are replaced with "<=".
Minimal Complete Verifiable Example
import xarray as xr
sign_1 = xr.DataArray(["="])
sign_2 = xr.DataArray(["=","<="])
sign_3 = xr.DataArray(["=","="])
sign_1_c = sign_1.where(sign_1 != "=", "<=")
sign_2_c = sign_2.where(sign_2 != "=", "<=")
sign_3_c = sign_3.where(sign_3 != "=", "<=")
print(sign_1_c)
print(sign_2_c)
print(sign_3_c)
MVCE confirmation
- [X] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [X] Complete example — the example is self-contained, including all data and the text of any traceback.
- [X] Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- [X] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [X] Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
print(sign_1_c)
<xarray.DataArray (dim_0: 1)> Size: 4B
array(['<'], dtype='<U1')
Dimensions without coordinates: dim_0
print(sign_2_c)
<xarray.DataArray (dim_0: 2)> Size: 16B
array(['<=', '<='], dtype='<U2')
Dimensions without coordinates: dim_0
print(sign_3_c)
<xarray.DataArray (dim_0: 2)> Size: 8B
array(['<', '<'], dtype='<U1')
Dimensions without coordinates: dim_0
Anything else we need to know?
No response
Environment
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
This is because the data type of the array is <U1, so it's truncating any string longer than that.
I think that's really confusing behavior.
Does anyone know whether this has always been the case? I admittedly don't use strings that much...
@max-sixty thanks a lot for your quick reply!
I can confirm that it worked at least until 2024.3.0. (I didn't update in the meantime, but I could do that)
EDIT: a colleague told me it probably worked until 2024.5.0, but I haven't tried that.
not sure whether this used to work (it could have), but the new string dtype in numpy>=2 completely removes this kind of issue.
OK, if it works on numpy>=2, I guess we deprioritize...
note that at the moment you still get the old character-based string dtypes by default, so you have to explicitly opt into the new string dtype (using np.dtypes.StringDtype, if I remember correctly).
note that at the moment you still get the old character-based string dtypes by default, so you have to explicitly opt into the new string dtype (using
np.dtypes.StringDtype, if I remember correctly).
Ah OK. So maybe we don't deprioritize :)
I just had a better look at this issue, and I believe it relates to us preferring explicit dtypes over implicit dtypes. What happens within xarray is:
np.result_type(np.dtype("<U1"), type("<=")) # `str` does not have a length, so the explicit dtype is taken
To work around that, we can pass a 0d array to where to explicitly dtype the new string:
sign_3.where(sign_3 != "=", np.array("<="))
but I'm not sure how to best fix this in general. In theory, we could special-case pre-numpy=2 string arrays and drop the length:
# instead of `preprocess_scalar_types`
def preprocess_types(t):
if isinstance(t, str | bytes):
return type(t)
elif isinstance(dtype := getattr(t, "dtype", t), np.dtypes.StrDType | np.dtypes.BytesDType):
return dtype.type
return t
Edit: though the best way would be to have np.result_type cast <U1 + str to <U automatically (and the same for S)