xarray icon indicating copy to clipboard operation
xarray copied to clipboard

```DataArray.sel``` can silently pick up the nearest point, even if it is far away and the query is out of bounds

Open jerabaul29 opened this issue 1 year ago • 13 comments

What is your issue?

@paulina-t (who found a bug caused by the behavior we report here in a codebase, where it was badly messing things up).

See the example notebook at https://github.com/jerabaul29/public_bug_reports/blob/main/xarray/2023_10_18/interp.ipynb .


Problem

It is always a bit risky to interpolate / find the nearest neighbor to a query or similar, as bad things can happen if querying a value for a point that is outside of the area that is represented. Fortunately, xarray returns NaN if performing interp outside of the bounds of a dataset:

import xarray as xr
import numpy as np

xr.__version__

'2023.9.0'

data = np.array([[1, 2, 3], [4, 5, 6]])
lat = [10, 20]
lon = [120, 130, 140]

data_xr = xr.DataArray(data, coords={'lat':lat, 'lon':lon}, dims=['lat', 'lon'])

data_xr

<xarray.DataArray (lat: 2, lon: 3)>
array([[1, 2, 3],
       [4, 5, 6]])
Coordinates:
  * lat      (lat) int64 10 20
  * lon      (lon) int64 120 130 140

# interp is civilized: rather than wildly extrapolating, it returns NaN

data_xr.interp(lat=15, lon=125)

<xarray.DataArray ()>
array(3.)
Coordinates:
    lat      int64 15
    lon      int64 125

data_xr.interp(lat=5, lon=125)

<xarray.DataArray ()>
array(nan)
Coordinates:
    lat      int64 5
    lon      int64 125

Unfortunately, .sel will happily find the nearest neighbor of a point, even if the input point is outside of the dataset range:

# sel is not as civilized: it happily finds the neares neighbor, even if it is "on the one side" of the example data

data_xr.sel(lat=5, lon=125, method='nearest')

<xarray.DataArray ()>
array(2)
Coordinates:
    lat      int64 10
    lon      int64 130

This can easily cause tricky bugs.


Discussion

Would it be possible for .sel to have a behavior that makes the user aware of such issues? I.e. either:

  • print a warning on stderr
  • return NaN
  • raise an exception

when performing a .sel query that is outside of a dataset range / not in between of 2 dataset points?

I understand that finding the nearest neighbor may still be useful / wanted in some cases even when being outside of the bounds of the dataset, but the fact that this happens silently by default has been causing bugs for us. Could either this default behavior be changed, or maybe enabled with a flag (allow_extrapolate=False by default for example, so users can consciously opt it in)?

jerabaul29 avatar Oct 19 '23 08:10 jerabaul29

To confirm, this is limited to the case when method="nearest" is passed; is that correct?

How about adding a method="interp" to .sel? Then that's clearly discriminated from method="nearest", which would continue to extrapolate.

max-sixty avatar Oct 19 '23 18:10 max-sixty

I think the point of sel is always to select an exact point rather than interpolate? I have not tested the other methods.

jerabaul29 avatar Oct 19 '23 18:10 jerabaul29

(can test tomorrow)

jerabaul29 avatar Oct 19 '23 18:10 jerabaul29

Yes, sorry, my message was unclear. It could be named method="nearest-surrounded" (or some better name).

My point was that we could offer an option to explicitly select internal points, rather than change the meaning of the existing method, or raise warnings, since I'd guess that some folks do want to allow selecting edges.

max-sixty avatar Oct 19 '23 18:10 max-sixty

That could be fine to me, but this is a bit scary too - that means turning on 'likely bug detection' becomes opt-in? Do you think that having a warning printing out would be so disturbing? Could it make sense to rather turn on a warning by default, with a possible opt-in to actively turn it off through an option flag?

jerabaul29 avatar Oct 19 '23 19:10 jerabaul29

I hadn't thought of this behavior as likely a bug. I also don't use this myself that much, so interested in what others think / whether you're confident here.

max-sixty avatar Oct 19 '23 19:10 max-sixty

I just think it is likely that there are many such uses lurking around with the user unaware that result in silent bugs. At least this was the case for us. For example looking up lat=-120 while lat range was 0...360 and not -180...180 in our case. That resulted in looking up lat 0 instead of expected -120 equivalent to 240.

jerabaul29 avatar Oct 19 '23 19:10 jerabaul29

For example looking up lat=-120 while lat range was 0...360 and not -180...180 in our case.

Yes, it makes sense that this is not intended.

But can we write down a rule that is general enough to handle all cases without much context? Since selecting the nearest value to 8 on (9, 18, 27) is probably not a bug.

max-sixty avatar Oct 19 '23 20:10 max-sixty

I don't think we can do a lot here (correct me if I'm wrong, though): the method kwarg is handled by the index implementation (in this case, a pandas.Index), so we'd need to convince pandas to implement this.

cc @benbovy

keewis avatar Oct 19 '23 21:10 keewis

I agree with @keewis. Xarray is just using plain pandas Index.get_indexer method under the hood.

@jerabaul29 To avoid pulling in data which is totally off (lat=-120, with lat-range 0..360) you can utilize tolerance-kwarg. You'd need to set it to grid_spacing/2 (5 in your case) to select any values using method="nearest". This will still select points outside your data range, but only as far away as grid_spacing/2 and will avoid totally off errors.

This might be solved by something like a PeriodicBoundaryIndex, where @benbovy is the right person to get involved.

Q: Might be a silly question, where are lat-ranges 0..360 or -180..-180 actually used?

kmuehlbauer avatar Oct 20 '23 06:10 kmuehlbauer

In theory we could implement some logic in the xarray PandasIndex wrapper such that it departs from pandas.Index, although I don't think it is always a good idea. In this case I agree with @kmuehlbauer using tolerance might be good enough?

An alternative index maybe inheriting from PandasIndex may also be a solution that avoids cluttering the API / behavior of the default index (if you look at the implementation of PandasIndex.sel it is already quite complicated).

benbovy avatar Oct 20 '23 07:10 benbovy

But can we write down a rule that is general enough to handle all cases without much context? Since selecting the nearest value to 8 on (9, 18, 27) is probably not a bug.

This is where I wonder if "just" plotting a warning on stderr, like

"Warning: .sel is used out of bounds; if you are sure this is not a bug, you can turn this warning off by adding the no_out_of_bounds_warning=Trueflag (default False)"

would be useful? :) . Then no need to worry about complex logics to find what is likely a bug or not (which may be application dependent), the user just gets warned, and can decide what to do / think about it? But at least ignoring the potential risk is a conscious choice. For us, this would have been enough to get pointed to it and find the bug, instead of needing to spend quite a bit of time debugging :) .

@jerabaul29 To avoid pulling in data which is totally off (lat=-120, with lat-range 0..360) you can utilize tolerance-kwarg. You'd need to set it to grid_spacing/2 (5 in your case) to select any values using method="nearest". This will still select points outside your data range, but only as far away as grid_spacing/2 and will avoid totally off errors.

Thanks for the suggestion of solution :) . The challenge I see is that this is "opt-in", not by default if I understand correctly? - I think there is a quite low probability that all people contributing to a large codebase, think about hardening each line of code, so this looks like a "weak" guarantee to catch such bugs in the future. If possible, I like "conservative / hardened by default" behavior, with opt-in to ignore, rather to turn in, sanity checks - but this is maybe just a bias on my end :) .

Q: Might be a silly question, where are lat-ranges 0..360 or -180..-180 actually used?

Sorry, it was lon not lat :) . Some ERA5 vs satellite images if I remember correctly, with some data using 0..360, and some -180..180, from 2 datasets we had to use together.


Regarding the discussion about pandas index: if pandas is willing to implement something like this on their end, that could be very useful - guess there are other users of pandas that may be bitten by this kind of behavior?

If it is hard / takes time / other issues to have pandas implement this, is there anything blocking from having xarray implement a couple of checks? I guess a couple of lookups with the axis bounds is on average cheap compared to getting a match? Is it correct that the index should be monotonic (we only use monotonic index here, but maybe this is just us?), and if so, would a simple check of the lookup value compared to the min and max be enough to check that the .sel query is within bounds, or is this naive / missing something? :)

jerabaul29 avatar Oct 20 '23 07:10 jerabaul29

I had a similar experience today when testing sel with longitude. I was able to see that sel returned an unexpected result and tracked it down to the mismatched longitude systems. It would have been nice to have a warning message. +1 for that offer.

jenseva avatar Apr 29 '24 23:04 jenseva