xarray
xarray copied to clipboard
Unexpected behavior of where on Dataset
What is your issue?
Hello
First of all thank for develloping xarray it has been a big help to me.
Question
I want to apply where
function on a dataset where all the DataArray
does not have the same dimension.
- my mask as dimension on x y
- DataArray that share dimension x and y should have the mask applied on each extra dimension d (This is ok)
- I expect for DataArray that does not share both x and y to not be filtered. In the example bellow it create an extra dimension on img2
Are my expectation strange ? How can it be fix ?
Minimum exemple
import xarray as xr
import numpy as np
ds=xr.Dataset()
ds['mask']=xr.DataArray(np.random.random([15,10]),dims=['x','y'])
ds['img']=xr.DataArray(np.random.random([15,10,2]),dims=['x','y','d'])
ds['img2']=xr.DataArray(np.random.random([15]),dims=['x'])
print(ds.img2)
ds=ds.where(ds.mask<0.5,drop=True)
print(ds.img2)
Output
<xarray.DataArray 'img2' (x: 15)>
array([0.80137073, 0.00117066, 0.68062196, 0.61115256, 0.62556509,
0.4765797 , 0.30742119, 0.5647503 , 0.18911253, 0.79291688,
0.33789015, 0.79486523, 0.46305262, 0.2584704 , 0.4172912 ])
Dimensions without coordinates: x
<xarray.DataArray 'img2' (x: 15, y: 10)>
array([[0.80137073, nan, 0.80137073, nan, 0.80137073,
nan, nan, nan, nan, nan],
[ nan, 0.00117066, 0.00117066, 0.00117066, nan,
0.00117066, 0.00117066, 0.00117066, 0.00117066, 0.00117066],
[ nan, nan, 0.68062196, 0.68062196, nan,
0.68062196, 0.68062196, 0.68062196, 0.68062196, nan],
[0.61115256, 0.61115256, nan, nan, 0.61115256,
nan, nan, 0.61115256, nan, nan],
[0.62556509, nan, nan, nan, nan,
nan, 0.62556509, nan, 0.62556509, nan],
[ nan, 0.4765797 , 0.4765797 , nan, 0.4765797 ,
nan, nan, 0.4765797 , nan, nan],
[0.30742119, nan, nan, 0.30742119, nan,
nan, 0.30742119, nan, nan, nan],
[0.5647503 , nan, nan, nan, 0.5647503 ,
nan, nan, nan, 0.5647503 , nan],
[ nan, nan, 0.18911253, nan, nan,
nan, nan, 0.18911253, 0.18911253, nan],
[ nan, 0.79291688, 0.79291688, 0.79291688, 0.79291688,
...
0.2584704 , nan, 0.2584704 , nan, 0.2584704 ],
[ nan, 0.4172912 , nan, 0.4172912 , nan,
0.4172912 , 0.4172912 , nan, nan, 0.4172912 ]])
Dimensions without coordinates: x, y
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!
This behaviour is intended. In xarray, many operations broadcast the arrays first, i.e. xarray tries to bring them to the same shape before applying an operation. If one dimension is missing in one array, the array is repeated along this dimension. Just try ds['img2']+ds['mask']
for example, you will see a similar behaviour. For your problem, you can try to create a subset with the arrays which contain x
and y
and apply where
only to this subset:
arrays_with_xy=[name for name in ds if 'x' in ds[name].dims and 'y' in ds[name].dims]
ds[arrays_with_xy].where(ds.mask<0.5,drop=True)
Ok thanks for your answer. I have a work around not probleme. It was just looking strange to me. May be a option in the where function could be usefull to provide different behavior. But if I am the only one having this "probleme" it is not necessary. Best Thomas
You're not the only one, this has been reported quite a few times before: #1234, #2969, #6879, #7587.
Personally I find the default behavior very annoying. I'm almost always using .where
to apply some sort of mask and never want this broadcasting. I have a helper function that implements the suggested workaround and rarely use the native xarray version. Would love to see a PR that adds a kwarg to control this behavior. Changing the default would require more discussion and probably a deprecation cycle.
Hey, Yes an argument to select the behavior could be a nice option without changing the default behavior because I think it is important. For those with the same issue here is my custom function: It might not be super efficient but here it is:
def where_only(ds,ds_condition,**kwargs):
i_set=set(ds_condition.dims)
ds_n=ds.where(ds_condition,**kwargs)
for var_name, data_array in ds.data_vars.items():
f_set=set(data_array.dims)
if not i_set.issubset(f_set):
ds_n[var_name]=data_array
return ds_n
with ds a dataset and ds_condition a booleen DataArray.
Best Thomas
The documentation of Dataset.where only says "This operation follows the normal broadcasting and alignment rules that xarray uses for binary arithmetic." That could mention what this means in some additional detail or give a link to a description of those rules. The "indexing and selecting data" introduction page does not cover this topic for dataset. One alternative is to use .sel() and .set_xindex() which dosent appear to suffer from this problem. The use of set_xindex is also not covered on the indexing and selecting data page. The topic of subsetting datasets along coordinate dimensions is insufficiently covered in the docs, IMO. So this is a suggestion to improve that. I think a little would go a long way in this case.
also +1 for kw to not broadcast in Dataset.where()
The documentation of Dataset.where only says "This operation follows the normal broadcasting and alignment rules that xarray uses for binary arithmetic." That could mention what this means in some additional detail or give a link to a description of those rules. The "indexing and selecting data" introduction page does not cover this topic for dataset. One alternative is to use .sel() and .set_xindex() which dosent appear to suffer from this problem. The use of set_xindex is also not covered on the indexing and selecting data page. The topic of subsetting datasets along coordinate dimensions is insufficiently covered in the docs, IMO. So this is a suggestion to improve that. I think a little would go a long way in this case.
Contributions to docs very welcome!
In general, docs are tougher than code for open-source projects, because those who write the code often use the docs less often than average. By being newer to the project, you have a precious resource — you can have an outsize impact on projects by finding docs that are confusing and making them better!