xarray
xarray copied to clipboard
Interpolate_na: Rework 'limit' argument documentation/implementation
What is your issue?
Currently, the 'limit' argument of interpolate_na
shows some counterintuitive/undocumented behaviour.
Take the following example:
import xarray as xr
import numpy as np
n=np.nan
da=xr.DataArray([n, n, n, 4, 5, n ,n ,n], dims=["y"])
da.interpolate_na('y', limit=1, fill_value='extrapolate')
This will produce the following result:
array([ 1., nan, nan, 4., 5., 6., nan, nan])
Two things are surprising, in my opinion:
- The interpolated value
1
at the beginning is far from any of the given values - The filling is done only towards the 'right'. This asymmetric behaviour is not mentioned in the documentation.
Comparison to pandas
Similar behaviour can be created using pandas with the following arguments:
da=xr.DataArray([n, n, n, 4, 5, n ,n ,n], dims=["y"])
dap=da.to_pandas()
dap.interpolate(method='slinear', limit=1, limit_direction='forward', fill_value='extrapolate')
Output
y
0 NaN
1 NaN
2 NaN
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN
dtype: float64
This is equivalent to the current xarray behaviour, except there is no 1
at the beginning.
Cause
Currently, the fill mask in xarray is implemented using a rolling window operation, where values outside the array are assumed to be valid (therefore the 1
). See xarray.core.missing._get_valid_fill_mask
Possible Solutions
Boundary Issue
Concerning the 1
at the beginning: I think this should be considered a bug. It is likely not what you would expect if you specify a limit. As stated, pandas does not create it as well.
Asymmetric Filling
Concerning the asymmetric filling, I see two options:
- No changes to the code, but mention in the documentation that (effectively), a forward-fill is done.
- Make something similar to what pandas is doing. In pandas, there are two additional arguments controlling the limit behaviour:
limit_direction
is controlling the fill direction (left, right or both).limit_area
effectively controls if we only do interpolation or allow for extrapolation as well.
What do you think?
Two comments if the pandas arguments are introduced:
- If the pandas arguments are introduced, one might raise the question, whether the
max_gap
argument of xarray is still necessary, since pandas does not have such an argument. However, if the user does not want any interpolation if a gap is bigger than a specified length, this is currently not possible with pandas (actually, themax_gap
feature is requested by the pandas community since a long time: pandas-dev/pandas#12187 ) - Coordinate handling: Currently,
limit
is operating on indices,max_gap
is operating on coordinates. Is there a deeper reason for this separation? Both operations limit the interpolation 'distance'. If not, similar touse_coordinates
, one could introduce something likelimit_use_coordinates
. If True,limit
as well asmax_gap
refer to numeric coordinates, otherwise to a linearly increasing index ([0,1,2,...]).
Do you have any thoughts on this?
I think with the following signature, the function will be backward compatible:
interpolate_na(dim, method='linear', use_coordinate=True, limit=None, limit_direction='forward', limit_area=None, limit_use_coordinate=False, max_gap=None)
Thanks for raising this issue @Ockenfuss and thoroughly documenting the behavior.
The interpolated value 1 at the beginning is far from any of the given values
Agreed that is surprising and seems like a bug!
The filling is done only towards the 'right'. This asymmetric behaviour is not mentioned in the documentation
A pull request would be most welcome @Ockenfuss - scoped to either just improve the documentation or with your preferred API change. We can discuss the preferred syntax in the pull request.
Thanks for the feedback @scottyhq! I used the time over christmas and tried to come up with an implementation, see #8577! As you suggested, I pointed out specific details to discuss in the PR.
After the Xarray meeting today (with @keewis and @dcherian) we discussed the preferred option for moving forward with cleaning up the limit
argument. The concensus was to introduce a new keyword argument (e.g., cutoff
) with the preferred default handling of limit, and to soft deprecate the old limit
argument.
Some more results from the discussion:
- This affects all filling methods: interpolate_na, fillna, ffill, bfill, ...
- We want two functionalities: limit (or cutoff) to limit filling distance, max_gap (or ?) to avoid any filling
- Both should work on coordinates or index
- The mask is a valuable product by itself
- We prefer a new API to avoid breaking changes
Here are my proposals:
Masking function Signature:
namask('t', limit=3, limit_coords=3h, max_gap=3, max_gap_coords=3h) #in line with pandas/current xarray
namask('t', limit=3, cutoff=3h, gap_limit=3, gap_cutoff=3h) #cutoff for coordinates, limit for index
For direction control:
namask(limit_direction=forward/backward/both, limit_area=outside/inside/None) #regardless of index or coordinate
Possible API:
Making it an object:
da.namask(...) #masking object
da.namask(...).fillna(123) #return dataarray
da.namask(...).mask #get mask only
Making it a method:
mask=da.namask(...)
da.fillna(123).where(~mask) #Invalid=True to match numpy behaviour?
How to deal with existing methods:
-
da.fillna(123)
shall still work - If any exsiting
limit
,max_gap
is used: Internally forward to new API, raise Deprecation Warning (but possibly support for very long/forever)
Anything else:
This almost feels like introducing numpy.ma
in xarray...
This is beyond the scope of this Issue! But we may copy some of the behaviour from numpy, e.g. invalid values are True
there.