xarray icon indicating copy to clipboard operation
xarray copied to clipboard

Fill gaps limited 7665

Open Ockenfuss opened this issue 1 year ago • 7 comments

Improve gap filling

  • Closes #7665
  • Combine current gap limiting in xarray (max_gap) with capabilities of pandas (limit_direction, limit_area) regarding filling of missing values
  • Provide a common function fill_gaps() to control gap length for all filling functions (ffill/bfill/interpolate_na/fillna)

This PR involves a full implementation, documentation and corresponding tests. As discussed in #7665, a new API function fill_gaps was introduced, to avoid breaking changes of existing code and keep the argument inflation of the filling functions to a minimum.

  • [x] Closes #7665 (also #9392 )
  • [x] Tests added

TBD if accepted:

  • [ ] User visible changes (including notable bug fixes) are documented in whats-new.rst
  • [ ] New functions/methods are listed in api.rst
  • [ ] Add Example in Documentation

Deprecations

  • Using limit in interpolate_na now requires numbagg or bottleneck. So far, limit worked without bottleneck, max_gap required it (not documented). Reason: max_gap and limit now share the same codebase using ffill. Since no one ever complained about the missing documentation and the error message is pretty clear and easy to resolve ("No module named 'bottleneck"), I hope this might be acceptable.

Ockenfuss avatar Aug 23 '24 16:08 Ockenfuss

Basically all the new methods are missing return types, please consider adding them.

The new class should be generic, I have added a few review comments here and there but more adoptions are needed.

Thanks a lot for the review of the typing! I`ll work through them asap and try to make all adoptions (probabily still need some help afterwards, though ...)

Ockenfuss avatar Aug 24 '24 20:08 Ockenfuss

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

Ockenfuss avatar Aug 26 '24 12:08 Ockenfuss

@headtr1ck I did my best to add type hints everywhere. Not sure how to resolve the last four remaining mypy errors...

I'm currently on holiday and it's difficult to debug these errors on the phone. I can do it in 1-2 weeks or someone else can help out?

headtr1ck avatar Aug 30 '24 16:08 headtr1ck

There are a couple of outstanding questions about the interfaces; in particular:

  • forward vs back vs direction: https://github.com/pydata/xarray/pull/9402#discussion_r1730481171
  • some names: https://github.com/pydata/xarray/pull/9402#discussion_r1742677060
  • (anything else @Ockenfuss ?)

Re typing — are the interfaces typed correctly? I think they are. Assuming those are correct, I would vote that it's fine to add ignores internally (ofc the internal checks are useful tests for the external interfaces, so ideally we'd fix those)

What's the best way of resolving those outstanding questions @pydata/xarray ? I know I've been terrible on attending calls now I'm on PT, but maybe we could do it on one of those? (would be great to minimize the delay on these sorts of PRs — I think they're really valuable and would love to engender more...)

max-sixty avatar Sep 10 '24 19:09 max-sixty

One more naming decision:

  • https://github.com/pydata/xarray/pull/9402#discussion_r1729342319 I am on conference this week, but mostly available the next two weeks, if anyone wants to discuss. Otherwise, I will join the next dev meeting on Sep. 25.

Ockenfuss avatar Sep 11 '24 16:09 Ockenfuss

Did you manage to discuss this this morning?

max-sixty avatar Sep 25 '24 23:09 max-sixty

Did you manage to discuss this this morning?

@max-sixty Yes, @Ockenfuss explained what decisions are to be made and encouraged to have a look here.

kmuehlbauer avatar Sep 26 '24 05:09 kmuehlbauer

Hi @max-sixty, not too much community feedback yet :) What do you think, shall we have another look on the three open naming issues and then push this over the finish line? I still think this is quite useful functionality for all experimentalists, where gaps of all sizes are common in measurement data.

Ockenfuss avatar Jan 19 '25 17:01 Ockenfuss

What do you think, shall we have another look on the three open naming issues and then push this over the finish line?

Yes let's do it! I would bias for action; if folks disagree then dissent welcome.

What's remaining? Personally I think we should simplify this: https://github.com/pydata/xarray/pull/9402#discussion_r1730481171 + resolve any names. The external types should be correct; internally some ignores are OK...

max-sixty avatar Jan 20 '25 20:01 max-sixty

Ok, I think from my side this is ready for merge. I stayed with the fill_gaps and limit_area names for now. Check out the doc build pages for fill_gaps and the GapMask object to get a feeling for it and let me know if it feels right :)

This PR leaves all existing fill function API untouched. If we want to bring some of the new limit arguments to the existing functions (like DataArray.fillna(limit=...)), this could be done later in separate PRs.

TBD:

  • Update Whatsnew

Ockenfuss avatar Jan 22 '25 17:01 Ockenfuss

Looks great, I think this is indeed very close to landing.

The one thing I'm stuck on is the concept of fill_direction="backward" ... .ffill... — can we explain what the purpose of this is? My mental model is that we're filling missing data from an edge in a single direction; I don't understand how what's going on to fill backwards in a forward fill.

(Totally fine if others get this, I'm not the gatekeeper here, but do think it's important the API makes sense...)

max-sixty avatar Jan 23 '25 02:01 max-sixty

See this comment for an explanation, or just try out what happens :) For example:

da=xr.DataArray([1, np.nan, np.nan,  2], dims=['x'])
da.fill_gaps('x', limit=1, limit_direction='backward').ffill()
<xarray.DataArray (x: 4)> Size: 32B
array([ 1., nan,  1.,  2.])
Dimensions without coordinates: x

I agree, this is maybe not obvious. Here, I explained why I left it as is. If you want, I can try to keep track of the limit_direction argument upon mask creation and raise an exception, if the user applies ffill later. But this requires some tweaks in the code.

Edit: I added a sentence to the fill_gaps doc page to explain the behaviour better.

Ockenfuss avatar Jan 23 '25 09:01 Ockenfuss

Yes, I see what it does, but (very very respectfully, with much admiration for the overall PR!), it doesn't make semantic sense to me.

When I think "backward", I mean "look behind you, is there a gap? OK now fill it". I don't see how it can mean "start one step back, and then decide which direction to continue". Why only one step?

I think we should try hard to have xarray's interfaces be understandable without much technical knowledge. (Indeed, I think there's probably a more precise way of describing my objection around edges and continuous vs. discrete, but hopefully this suffices).

Would you object to removing the limit_direction param, and deferring that to the method?

max-sixty avatar Jan 23 '25 19:01 max-sixty

Thank you very much for the feedback and efforts! I am not entirely sure what you mean with decide and continuous vs discrete, I have to admit. I drew you a little picture, to visualize how I see those things :) 1000039597

But I think, I can rework the API. Just to clarify before I start, what you aim for is the following?

da.fill_gaps('x', limit=1).ffill() # no direction argument anymore in the first function

In the case of interpolate_na and fillna, which work in all directions:

da.fill_gaps('x', limit=1).interpolate_na(limit_direction='forward')

Ockenfuss avatar Jan 23 '25 22:01 Ockenfuss

Ha, thanks a lot for the drawing! I appreciate it!

FWIW the step there that didn't make sense to me is "one step backward" for bfill & ffill — my mental model is that filling doesn't require this, and if "one step backward" isn't in others' mental models, they'll also be confused by it...

interpolate_na is an interesting one, since you're correct that can go both ways:

  • Having a kwarg on interpolate_na is reasonable.
  • I could imagine a builder pattern where we just keep running methods .interpolate(...).with_direction(...), but that's not really pythonic and is inconsistent with xarray's API
  • or have limit_direction as it is now, but default to the fill direction and raise an error if it's not the same

max-sixty avatar Jan 24 '25 00:01 max-sixty

I went for option 3: Default to 'forward' in case of ffill and 'backward' in case of bfill. If this is not the case, an error is raised. (This is similar to what pandas is doing, check out the docstring here.) I adapted the code and docs and added some tests to check this behaviour. What do you think?

Ockenfuss avatar Jan 24 '25 11:01 Ockenfuss

I'm late to the discussion (thanks for sticking with us for so long, even though we've been really, really slow!), but to me ffill with limit_direction="backwards" seems to mean "take the value before the gap, but start filling (and counting) from the back" (and the inverse for bfill: "take the values from after the gap and start filling from the front of the gap"). Which I also find odd, because I can't imagine a use case for that (maybe there is one I just can't see?). We can't use pandas as a precedent; that only has limit_area which is a different thing. I do get the value of limit_direction for interpolate, though. So the question I think is: are we confident that this will end up being useful? Or should this rather be postponed until we have a clear use case?

Other than that, the only suggestion I have is to look at the naming of the API (I guess I'm bikeshedding?):

ds.fill_gaps(...).interpolate(...)  # no need to call it `interpolate_na` since it already is in a namespace?

kinda works, but

ds.fill_gaps(...).ffill(...)
ds.fill_gaps(...).bfill(...)

repeats "fill" and feels a bit awkward (it does fit with the limit* kwargs, though). What about just gaps or, if we don't want to assume everyone understands gaps the same way, na? That would make the API:

ds.gaps(...).ffill(...)
ds.gaps(...).interpolate(...)

or

ds.na(...).ffill(...)
ds.na(...).interpolate(...)

The latter may be a bit short and missing the semantic hint of "we're working on gaps", though. (And either way, these are suggestions, not requests, so feel free to reject them if it doesn't make too much sense or if it has already been discussed in a previous version of this PR).

keewis avatar Jan 24 '25 23:01 keewis

Thanks you, no worries, all feedback is welcome everytime!!

ffill+direction backward issue

Your understanding is correct. I have no immediate use case for it, but I thought initially why forbidding something that is possible. However, Max also thought this case may be confusing, so we decided to raise an error if attempting it now.

Names

I am open to anything! I am with you that having "gap" somehow in the name helps for the understanding. Just "gaps" is also fine for me.

Anything else

You might wonder: Why a new fill_gaps API function and not just equip the existing functions with new arguments (e.g. for ffill: limit, limit_area, max_gap, limit_use_coordinate)? Originally, we were mostly concerened about interpolate_na. Stephan and Deepak recommended a new API, since the existing interpolate_na has a very specific legacy behaviour. So for interpolate_na, we need a new API if we want to avoid deprecations. For ffill/bfill, both solutions have something in favour: The current implementation keeps all gap options nicely in one place. Directly adding the arguments instead would allow for some more freedom, e.g. the limit_direction argument could simply be omitted.

Ockenfuss avatar Jan 25 '25 14:01 Ockenfuss

Great!

For the name, a verb is consistent with the existing interface (resample, groupby, etc). fill_gaps seems quite reasonable, and I can't think of anything that's a single word...

...unless fill could work? It contends with np.fill being more general. But then the methods on the object are specific to filling missing values.

...so maybe that could work? If others agree then I would be in favor of fill...


Unless we do decide to change the name, is anything else remaining? Last call for feedback @pydata/xarray !

max-sixty avatar Jan 27 '25 19:01 max-sixty

How about gapfill? Two characters less than fill_gaps, single word and verb.

kmuehlbauer avatar Jan 27 '25 21:01 kmuehlbauer

gapfill sounds good to me! @max-sixty, are you ok as well? Then I change code+docs accordingly.

Ockenfuss avatar Jan 31 '25 18:01 Ockenfuss

How about gapfill? Two characters less than fill_gaps, single word and verb.

Does gapfill go against the python convention of separating words with underscores?

(I don't want to bikeshed, if others are really keen on this then totally OK)

@pydata/xarray any final call for name suggestions?

max-sixty avatar Feb 01 '25 01:02 max-sixty

To summarize:

  • fill: Pro: short+straightforward. Con: Easily confused with ffill, does not highlight the 'gap'
  • fill_gaps: Pro: Highlights the gaps. Con: Long
  • gapfill: Pro: Shorter and easier to type than gapfill. Con: nount first and no underscore separation are not pythonic.

I have a weak tendency towards fill_gaps (that's why I chose it in the beginning). But your view as external reviewer is probably more objective. I think you have to be the benevolent dictator and make a decision :)

Ockenfuss avatar Feb 05 '25 10:02 Ockenfuss