xarray
xarray copied to clipboard
```DataArray.sel``` can silently pick up the nearest point, even if it is far away and the query is out of bounds
What is your issue?
@paulina-t (who found a bug caused by the behavior we report here in a codebase, where it was badly messing things up).
See the example notebook at https://github.com/jerabaul29/public_bug_reports/blob/main/xarray/2023_10_18/interp.ipynb .
Problem
It is always a bit risky to interpolate / find the nearest neighbor to a query or similar, as bad things can happen if querying a value for a point that is outside of the area that is represented. Fortunately, xarray returns NaN if performing interp
outside of the bounds of a dataset:
import xarray as xr
import numpy as np
xr.__version__
'2023.9.0'
data = np.array([[1, 2, 3], [4, 5, 6]])
lat = [10, 20]
lon = [120, 130, 140]
data_xr = xr.DataArray(data, coords={'lat':lat, 'lon':lon}, dims=['lat', 'lon'])
data_xr
<xarray.DataArray (lat: 2, lon: 3)>
array([[1, 2, 3],
[4, 5, 6]])
Coordinates:
* lat (lat) int64 10 20
* lon (lon) int64 120 130 140
# interp is civilized: rather than wildly extrapolating, it returns NaN
data_xr.interp(lat=15, lon=125)
<xarray.DataArray ()>
array(3.)
Coordinates:
lat int64 15
lon int64 125
data_xr.interp(lat=5, lon=125)
<xarray.DataArray ()>
array(nan)
Coordinates:
lat int64 5
lon int64 125
Unfortunately, .sel
will happily find the nearest
neighbor of a point, even if the input point is outside of the dataset range:
# sel is not as civilized: it happily finds the neares neighbor, even if it is "on the one side" of the example data
data_xr.sel(lat=5, lon=125, method='nearest')
<xarray.DataArray ()>
array(2)
Coordinates:
lat int64 10
lon int64 130
This can easily cause tricky bugs.
Discussion
Would it be possible for .sel
to have a behavior that makes the user aware of such issues? I.e. either:
- print a warning on stderr
- return NaN
- raise an exception
when performing a .sel
query that is outside of a dataset range / not in between of 2 dataset points?
I understand that finding the nearest neighbor may still be useful / wanted in some cases even when being outside of the bounds of the dataset, but the fact that this happens silently by default has been causing bugs for us. Could either this default behavior be changed, or maybe enabled with a flag (allow_extrapolate=False
by default for example, so users can consciously opt it in)?
To confirm, this is limited to the case when method="nearest"
is passed; is that correct?
How about adding a method="interp"
to .sel
? Then that's clearly discriminated from method="nearest"
, which would continue to extrapolate.
I think the point of sel is always to select an exact point rather than interpolate? I have not tested the other methods.
(can test tomorrow)
Yes, sorry, my message was unclear. It could be named method="nearest-surrounded"
(or some better name).
My point was that we could offer an option to explicitly select internal points, rather than change the meaning of the existing method, or raise warnings, since I'd guess that some folks do want to allow selecting edges.
That could be fine to me, but this is a bit scary too - that means turning on 'likely bug detection' becomes opt-in? Do you think that having a warning printing out would be so disturbing? Could it make sense to rather turn on a warning by default, with a possible opt-in to actively turn it off through an option flag?
I hadn't thought of this behavior as likely a bug. I also don't use this myself that much, so interested in what others think / whether you're confident here.
I just think it is likely that there are many such uses lurking around with the user unaware that result in silent bugs. At least this was the case for us. For example looking up lat=-120 while lat range was 0...360 and not -180...180 in our case. That resulted in looking up lat 0 instead of expected -120 equivalent to 240.
For example looking up lat=-120 while lat range was 0...360 and not -180...180 in our case.
Yes, it makes sense that this is not intended.
But can we write down a rule that is general enough to handle all cases without much context? Since selecting the nearest value to 8 on (9, 18, 27)
is probably not a bug.
I don't think we can do a lot here (correct me if I'm wrong, though): the method
kwarg is handled by the index implementation (in this case, a pandas.Index
), so we'd need to convince pandas
to implement this.
cc @benbovy
I agree with @keewis. Xarray is just using plain pandas Index.get_indexer method under the hood.
@jerabaul29 To avoid pulling in data which is totally off (lat=-120, with lat-range 0..360) you can utilize tolerance
-kwarg. You'd need to set it to grid_spacing/2 (5 in your case) to select any values using method="nearest"
. This will still select points outside
your data range, but only as far away as grid_spacing/2 and will avoid totally off errors.
This might be solved by something like a PeriodicBoundaryIndex
, where @benbovy is the right person to get involved.
Q: Might be a silly question, where are lat-ranges 0..360 or -180..-180 actually used?
In theory we could implement some logic in the xarray PandasIndex
wrapper such that it departs from pandas.Index
, although I don't think it is always a good idea. In this case I agree with @kmuehlbauer using tolerance
might be good enough?
An alternative index maybe inheriting from PandasIndex
may also be a solution that avoids cluttering the API / behavior of the default index (if you look at the implementation of PandasIndex.sel
it is already quite complicated).
But can we write down a rule that is general enough to handle all cases without much context? Since selecting the nearest value to 8 on (9, 18, 27) is probably not a bug.
This is where I wonder if "just" plotting a warning on stderr, like
"Warning: .sel is used out of bounds; if you are sure this is not a bug, you can turn this warning off by adding the no_out_of_bounds_warning=True
flag (default False)"
would be useful? :) . Then no need to worry about complex logics to find what is likely a bug or not (which may be application dependent), the user just gets warned, and can decide what to do / think about it? But at least ignoring the potential risk is a conscious choice. For us, this would have been enough to get pointed to it and find the bug, instead of needing to spend quite a bit of time debugging :) .
@jerabaul29 To avoid pulling in data which is totally off (lat=-120, with lat-range 0..360) you can utilize tolerance-kwarg. You'd need to set it to grid_spacing/2 (5 in your case) to select any values using method="nearest". This will still select points outside your data range, but only as far away as grid_spacing/2 and will avoid totally off errors.
Thanks for the suggestion of solution :) . The challenge I see is that this is "opt-in", not by default if I understand correctly? - I think there is a quite low probability that all people contributing to a large codebase, think about hardening each line of code, so this looks like a "weak" guarantee to catch such bugs in the future. If possible, I like "conservative / hardened by default" behavior, with opt-in to ignore, rather to turn in, sanity checks - but this is maybe just a bias on my end :) .
Q: Might be a silly question, where are lat-ranges 0..360 or -180..-180 actually used?
Sorry, it was lon not lat :) . Some ERA5 vs satellite images if I remember correctly, with some data using 0..360, and some -180..180, from 2 datasets we had to use together.
Regarding the discussion about pandas index: if pandas is willing to implement something like this on their end, that could be very useful - guess there are other users of pandas that may be bitten by this kind of behavior?
If it is hard / takes time / other issues to have pandas implement this, is there anything blocking from having xarray implement a couple of checks? I guess a couple of lookups with the axis bounds is on average cheap compared to getting a match? Is it correct that the index should be monotonic (we only use monotonic index here, but maybe this is just us?), and if so, would a simple check of the lookup value compared to the min and max be enough to check that the .sel query is within bounds, or is this naive / missing something? :)
I had a similar experience today when testing sel with longitude. I was able to see that sel returned an unexpected result and tracked it down to the mismatched longitude systems. It would have been nice to have a warning message. +1 for that offer.