xarray Unexpected behavior of where on Dataset

What is your issue?

Hello

First of all thank for develloping xarray it has been a big help to me.

Question I want to apply where function on a dataset where all the DataArray does not have the same dimension.

my mask as dimension on x y
DataArray that share dimension x and y should have the mask applied on each extra dimension d (This is ok)
I expect for DataArray that does not share both x and y to not be filtered. In the example bellow it create an extra dimension on img2

Are my expectation strange ? How can it be fix ?

Minimum exemple

import xarray as xr
import numpy as np

ds=xr.Dataset()
ds['mask']=xr.DataArray(np.random.random([15,10]),dims=['x','y'])
ds['img']=xr.DataArray(np.random.random([15,10,2]),dims=['x','y','d'])
ds['img2']=xr.DataArray(np.random.random([15]),dims=['x'])

print(ds.img2)

ds=ds.where(ds.mask<0.5,drop=True)

print(ds.img2)

Output

<xarray.DataArray 'img2' (x: 15)>
array([0.80137073, 0.00117066, 0.68062196, 0.61115256, 0.62556509,
       0.4765797 , 0.30742119, 0.5647503 , 0.18911253, 0.79291688,
       0.33789015, 0.79486523, 0.46305262, 0.2584704 , 0.4172912 ])
Dimensions without coordinates: x
<xarray.DataArray 'img2' (x: 15, y: 10)>
array([[0.80137073,        nan, 0.80137073,        nan, 0.80137073,
               nan,        nan,        nan,        nan,        nan],
       [       nan, 0.00117066, 0.00117066, 0.00117066,        nan,
        0.00117066, 0.00117066, 0.00117066, 0.00117066, 0.00117066],
       [       nan,        nan, 0.68062196, 0.68062196,        nan,
        0.68062196, 0.68062196, 0.68062196, 0.68062196,        nan],
       [0.61115256, 0.61115256,        nan,        nan, 0.61115256,
               nan,        nan, 0.61115256,        nan,        nan],
       [0.62556509,        nan,        nan,        nan,        nan,
               nan, 0.62556509,        nan, 0.62556509,        nan],
       [       nan, 0.4765797 , 0.4765797 ,        nan, 0.4765797 ,
               nan,        nan, 0.4765797 ,        nan,        nan],
       [0.30742119,        nan,        nan, 0.30742119,        nan,
               nan, 0.30742119,        nan,        nan,        nan],
       [0.5647503 ,        nan,        nan,        nan, 0.5647503 ,
               nan,        nan,        nan, 0.5647503 ,        nan],
       [       nan,        nan, 0.18911253,        nan,        nan,
               nan,        nan, 0.18911253, 0.18911253,        nan],
       [       nan, 0.79291688, 0.79291688, 0.79291688, 0.79291688,
...
        0.2584704 ,        nan, 0.2584704 ,        nan, 0.2584704 ],
       [       nan, 0.4172912 ,        nan, 0.4172912 ,        nan,
        0.4172912 , 0.4172912 ,        nan,        nan, 0.4172912 ]])
Dimensions without coordinates: x, y

Feb 29 '24 11:02 ThomasChauve

Thanks for opening your first issue here at xarray! Be sure to follow the issue template! If you have an idea for a solution, we would really welcome a Pull Request with proposed changes. See the Contributing Guide for more. It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better. Thank you!

Feb 29 '24 11:02 welcome[bot]

This behaviour is intended. In xarray, many operations broadcast the arrays first, i.e. xarray tries to bring them to the same shape before applying an operation. If one dimension is missing in one array, the array is repeated along this dimension. Just try ds['img2']+ds['mask'] for example, you will see a similar behaviour. For your problem, you can try to create a subset with the arrays which contain x and y and apply where only to this subset:

arrays_with_xy=[name for name in ds if 'x' in ds[name].dims and 'y' in ds[name].dims]
ds[arrays_with_xy].where(ds.mask<0.5,drop=True)

Mar 01 '24 12:03 Ockenfuss

Ok thanks for your answer. I have a work around not probleme. It was just looking strange to me. May be a option in the where function could be usefull to provide different behavior. But if I am the only one having this "probleme" it is not necessary. Best Thomas

Mar 01 '24 12:03 ThomasChauve

You're not the only one, this has been reported quite a few times before: #1234, #2969, #6879, #7587.

Personally I find the default behavior very annoying. I'm almost always using .where to apply some sort of mask and never want this broadcasting. I have a helper function that implements the suggested workaround and rarely use the native xarray version. Would love to see a PR that adds a kwarg to control this behavior. Changing the default would require more discussion and probably a deprecation cycle.

Mar 02 '24 18:03 slevang

Hey, Yes an argument to select the behavior could be a nice option without changing the default behavior because I think it is important. For those with the same issue here is my custom function: It might not be super efficient but here it is:

def where_only(ds,ds_condition,**kwargs):

    i_set=set(ds_condition.dims)
    ds_n=ds.where(ds_condition,**kwargs)
    for var_name, data_array in ds.data_vars.items():
        f_set=set(data_array.dims)
        if not i_set.issubset(f_set):
            ds_n[var_name]=data_array

    return ds_n

with ds a dataset and ds_condition a booleen DataArray.

Best Thomas

Mar 03 '24 08:03 ThomasChauve

The documentation of Dataset.where only says "This operation follows the normal broadcasting and alignment rules that xarray uses for binary arithmetic." That could mention what this means in some additional detail or give a link to a description of those rules. The "indexing and selecting data" introduction page does not cover this topic for dataset. One alternative is to use .sel() and .set_xindex() which dosent appear to suffer from this problem. The use of set_xindex is also not covered on the indexing and selecting data page. The topic of subsetting datasets along coordinate dimensions is insufficiently covered in the docs, IMO. So this is a suggestion to improve that. I think a little would go a long way in this case.

also +1 for kw to not broadcast in Dataset.where()

Jun 07 '24 14:06 jmccreight

The documentation of Dataset.where only says "This operation follows the normal broadcasting and alignment rules that xarray uses for binary arithmetic." That could mention what this means in some additional detail or give a link to a description of those rules. The "indexing and selecting data" introduction page does not cover this topic for dataset. One alternative is to use .sel() and .set_xindex() which dosent appear to suffer from this problem. The use of set_xindex is also not covered on the indexing and selecting data page. The topic of subsetting datasets along coordinate dimensions is insufficiently covered in the docs, IMO. So this is a suggestion to improve that. I think a little would go a long way in this case.

Contributions to docs very welcome!

In general, docs are tougher than code for open-source projects, because those who write the code often use the docs less often than average. By being newer to the project, you have a precious resource — you can have an outsize impact on projects by finding docs that are confusing and making them better!

Jun 07 '24 19:06 max-sixty

xarray xarray copied to clipboard

Unexpected behavior of where on Dataset

What is your issue?

xarray
xarray copied to clipboard