pandas icon indicating copy to clipboard operation
pandas copied to clipboard

Various methods don't call call __finalize__

Open TomAugspurger opened this issue 5 years ago • 44 comments

Improve coverage of NDFrame.__finalize__

Pandas uses NDFrame.__finalize__ to propagate metadata from one NDFrame to another. This ensures that things like self.attrs and self.flags are not lost. In general we would like that any operation that accepts one or more NDFrames and returns an NDFrame should propagate metadata by calling __finalize__.

The test file at https://github.com/pandas-dev/pandas/blob/master/pandas/tests/generic/test_finalize.py attempts to be an exhaustive suite of tests for all these cases. However there are many tests currently xfailing, and there are likely many APIs not covered.

This is a meta-issue to improve the use of __finalize__. Here's a hopefully accurate list of methods that don't currently call finalize.

Some general comments around finalize

  1. We don't have a good sense for what should happen to attrs when there are multiple NDFrames involved with differing attrs (e.g. in concat). The safest approach is to probably drop the attrs when they don't match, but this will need some thought.
  2. We need to be mindful of performance. __finalize__ can be somewhat expensive so we'd like to call it exactly once per user-facing method. This can be tricky for things like DataFrame.apply which is sometimes used internally. We may need to refactor some methods to have a user-facing DataFrame.apply that calls an internal DataFrame._apply. The internal method would not call __finalize__, just the user-facing DataFrame.apply would.

If you're interested in working on this please post a comment indicating which method you're working on. Un-xfail the test, then update the method to pass the test. Some of these will be much more difficult to work on than others (e.g. groupby is going to be difficult). If you're unsure whether a particular method is likely to be difficult, ask first.

  • [x] DataFrame.__getitem__ with a scalar
  • [ ] DataFrame.eval with engine="numexpr"
  • [x] DataFrame.duplicated
  • [ ] DataFrame.add, mul, etc.
  • [ ] DataFrame.combine, DataFrame.combine_first
  • [x] DataFrame.update
  • [x] DataFrame.pivot, pivot_table
  • [x] DataFrame.stack
  • [x] DataFrame.unstack
  • [x] DataFrame.explode https://github.com/pandas-dev/pandas/pull/46629
  • [ ] DataFrame.melt https://github.com/pandas-dev/pandas/pull/46648
  • [x] DataFrame.diff
  • [x] DataFrame.applymap
  • [x] DataFrame.append
  • [ ] DataFrame.merge
  • [ ] DataFrame.cov
  • [ ] DataFrame.corrwith
  • [ ] DataFrame.count
  • [ ] DataFrame.nunique
  • [ ] DataFrame.idxmax, idxmin
  • [ ] DataFrame.mode
  • [x] DataFrame.quantile (scalar and list of quantiles)
  • [ ] DataFrame.isin
  • [ ] DataFrame.pop
  • [ ] DataFrame.squeeze
  • [ ] Series.abs
  • [ ] DataFrame.get
  • [ ] DataFrame.round
  • [ ] DataFrame.convert_dtypes
  • [ ] DataFrame.pct_change
  • [ ] DataFrame.transform
  • [ ] DataFrame.apply
  • [ ] DataFrame.any, sum, std, mean, etdc.
  • [x] Series.str. operations returning a Series / DataFrame
  • [x] Series.dt. operations returning a Series / DataFrame
  • [x] Series.cat. operations returning a Series / DataFrame
  • [ ] All groupby operations
  • [x] .iloc / .loc https://github.com/pandas-dev/pandas/pull/46101

TomAugspurger avatar Sep 04 '19 12:09 TomAugspurger

How __finalize__ works with methods like concat requires a bit of discussion. How do we reconcile metadata from multiple sources?

In #27108, I'm using _metadata to propagate whether the NDFrame allows duplicate labels. In this situation, my ideal reconciliation function would be

def reconcile_concat(others: List[DataFrame]) -> bool:
    """
    Allow duplicates only if all the inputs allowed them.

    If any disallow them, we disallow them.
    """
    return all(x.allows_duplicates for x in others)

However, that reconciliation strategy isn't valid / proper for arbitrary metadata. Which I think argues for some kind of dispatch system for reconciling metadata, where the attribute gets to determine how things are handled.

allows_duplicate_meta = PandasMetadata("allows_duplicates")  # the attribute name

@allows_duplicate_meta.register(pd.concat)  # the method
def reconcile_concat():
    ...

Then we always pass method to NDFrame.__finalize__, which we'll use to look up the function for how to reconcile things. A default reconciliation can be provided.

cc @jbrockmendel, since I think you looked into metadata propagation in another issue.

TomAugspurger avatar Sep 04 '19 21:09 TomAugspurger

IIRC I was looking at _metadata to try to implement units (this predated EAs). One of the biggest problems I had was that metadata on a Series didn't behave well when that Series is inserted into a DataFrame.

Do we have an idea of how often _metadata is used in the wild? i.e. could we deprecate it and make an EA-based implementation?

jbrockmendel avatar Sep 04 '19 22:09 jbrockmendel

It’s essentially undocumented, so I’m OK with being aggressive here.

What would an EA-based implementation look like? For something like units, metadata may not be appropriate. I think an EA dtype makes more sense.

I’ll add this to the agenda for next weeks call.

TomAugspurger avatar Sep 05 '19 00:09 TomAugspurger

What would an EA-based implementation look like?

It's a bit ambiguous what this question is referring to, but big picture something like _metadata but with dispatch logic could be done with something like (for this example I'm thinking of units in physics where (4 meters) / (2 seconds) == 2 m/s):

class MyDtype(ExtensionDtype):
    def __mul__(self, other):
        other = getattr(other, "dtype", other)
        return some_new_dtype


class MyEA(ExtensionArray):
     def __mul__(self, other):
         result = self._data * other
         dtype = self.dtype * other
         return type(self)(result, dtype)

jbrockmendel avatar Sep 05 '19 00:09 jbrockmendel

Right. Does the current EA interface suffice for that use case, or are there additional hooks needed?

TomAugspurger avatar Sep 05 '19 01:09 TomAugspurger

Not a blocker for 1.0.

TomAugspurger avatar Nov 12 '19 17:11 TomAugspurger

Do people think that accessor methods like .str and .dt should call finalize? IMO, yes they should.

TomAugspurger avatar Apr 06 '20 16:04 TomAugspurger

But would you then handle name propagation there?

jorisvandenbossche avatar Apr 06 '20 17:04 jorisvandenbossche

Name propagation isn't (currently) handled in __finalize__. I don't think it should be handled there currently, since the current __finalize__ isn't well suited to resolving the name when there are multiple inputs. In the future it might make sense.

My two motivating use-cases here are

  1. My allow_duplicate_labels PR, for disallowing duplicates
  2. A workload that preserves something like .attrs["source"] = "file.csv" through as many operations as makes sense.

TomAugspurger avatar Apr 06 '20 18:04 TomAugspurger

Name propagation isn't (currently) handled in finalize.

It actually is, not? At least in some cases? (eg for new Series originating from other Series, where other is that Series).

jorisvandenbossche avatar Apr 06 '20 18:04 jorisvandenbossche

Apologies, I forgot that name was in _metadata. So yes, name handling could be handled there.

TomAugspurger avatar Apr 06 '20 18:04 TomAugspurger

When should we call finalize? A high-level list:

Yes

  1. Reshape operations (stack, unstack, reset_index, set_index, to_frame)
  2. Indexing (take, loc, iloc, __getitem__, reindex, drop, assign, select_dtypes, nlargest?)
  3. "transforms" (repeat, explode, shift, diff, round, isin, fillna, isna, dropna, copy, rename, applymap, .transform, sort_values, sort_index)
  4. Accessor methods returning Series .str, .dt, .cat
  5. Binary ops (arithmetic, comparison, combine,
  6. ufuncs
  7. concat / merge / joins, append
  8. cumulative aggregations (cumsum)?

Unsure

  1. Reductions (DataFrame.sum(), etc. count, quantlie, idxmin)
  2. groupby, pivot, pivot_table?
  3. DataFrame.apply?
  4. corr, cov, etc.

These are somewhat arbitrary. I can't really come up with a rule why a reduction like DataFrame.sum() shouldn't call __finalize__ while DataFrame.cumsum does. So perhaps the rule should be "any NDFrame method returning an NDFrame (or Tuple[NDFrame]) will call __finalize__". I like the simplicity of that.

TomAugspurger avatar Apr 07 '20 14:04 TomAugspurger

I've updated the original post. Hopefully we can find some contributors interested in working on getting finalize called in more places.

TomAugspurger avatar Sep 03 '20 13:09 TomAugspurger

@TomAugspurger I would be interested in contributing to pandas and start by helping to tackle some of these methods. Which methods might be good places to start?

Sadin avatar Sep 03 '20 14:09 Sadin

Thanks @Sadin. I think DataFrame.duplicated would be a good one.

TomAugspurger avatar Sep 03 '20 14:09 TomAugspurger

@TomAugspurger I would be interested in contributing, is it all right if I work on DataFrame.unstack?

mzeitlin11 avatar Oct 17 '20 16:10 mzeitlin11

@mzeitlin11 that'd be great. Looks like @arw2019 is handling stack in #37186

TomAugspurger avatar Oct 17 '20 19:10 TomAugspurger

Series attrs are lost when using df.iloc. Dask relies on iloc when creating meta dataframes so the attrs will always be lost for dask series.

Example:

import pandas as pd
df = pd.DataFrame({"A": [1, 2], "B": [3, 4], "C": [5, 6]})
df.attrs = {'date': '2020-10-16'}
df.A.attrs['unit'] = 'kg'
df.iloc[:0].attrs
Out[20]: {'date': '2020-10-16'}
df.iloc[:0].A.attrs
Out[21]: {}

Further reading: https://github.com/dask/dask/issues/6730

Illviljan avatar Oct 18 '20 08:10 Illviljan

@TomAugspurger: I have a potential fix for part of the groupby issues, see c882db554888b141fa95d1891f9cb84ab06b95f4. The code was originally included in PR #35688 but was taken out to allow a separate review of potential performance issues.

Should I just reference this issue in a pull request, or open a more specific issue first?

Japanuspus avatar Oct 20 '20 09:10 Japanuspus

@TomAugspurger I'm interested in taking my first issue. Can I take DataFrame.Append?

abbywh avatar Oct 23 '20 01:10 abbywh

That'd be great.

On Thu, Oct 22, 2020 at 8:57 PM Joel Whittier [email protected] wrote:

@TomAugspurger https://github.com/TomAugspurger I'm interested in taking my first issue. Can I take DataFrame.Append?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28283#issuecomment-714858662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOITDM2KY2P2YOO2IIVLSMDPIHANCNFSM4ITRD3TQ .

TomAugspurger avatar Oct 23 '20 11:10 TomAugspurger

Finished Append. Also did merge, eval, diff, applymap. Should have a PR coming soon.

abbywh avatar Oct 24 '20 15:10 abbywh

I have a potential fix for part of the groupby issues, see c882db5. The code was originally included in PR #35688 but was taken out to allow a separate review of potential performance issues.

@Japanuspus do you want to open a PR for it?

jorisvandenbossche avatar Oct 27 '20 21:10 jorisvandenbossche

I have a potential fix for part of the groupby issues, see c882db5. The code was originally included in PR #35688 but was taken out to allow a separate review of potential performance issues.

@Japanuspus do you want to open a PR for it?

Sure -- I'll go ahead and open with a reference to this issue.

Japanuspus avatar Oct 27 '20 22:10 Japanuspus

@TomAugspurger I think pivot_table hasn't been done yet, as it is still marked with xfail. Is it okay if I take this one?

liaoaoyuan97 avatar Jan 23 '21 03:01 liaoaoyuan97

Yep, thanks.

On Jan 22, 2021, at 9:22 PM, Eve [email protected] wrote:

@TomAugspurger https://github.com/TomAugspurger I think pivot_table hasn't been done yet, as it is still marked with xfail. Is it okay if I take this one?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28283#issuecomment-765851582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIW4UKSD4UH3RCM4VK3S3I6IPANCNFSM4ITRD3TQ.

TomAugspurger avatar Jan 23 '21 14:01 TomAugspurger

@TomAugspurger Do you still remember why ? is necessary in this regular expression? https://github.com/pandas-dev/pandas/blob/0c18cc6b8a76031eeed9380c365b4149994b72df/pandas/tests/generic/test_finalize.py#L497 It seems to duplicate *, if it means the capture group is optional.

liaoaoyuan97 avatar Feb 11 '21 05:02 liaoaoyuan97

I’m not sure.

On Feb 10, 2021, at 23:06, Eve [email protected] wrote:

 @TomAugspurger Do you still remember why ? is necessary in this regular expression? https://github.com/pandas-dev/pandas/blob/0c18cc6b8a76031eeed9380c365b4149994b72df/pandas/tests/generic/test_finalize.py#L497 It seems to duplicate *, if it means the capture group is optional.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

TomAugspurger avatar Feb 11 '21 12:02 TomAugspurger

I would like to work on few methods which are not so complicated to fix. Can someone please point me in the right direction? This will be my first contribution.

palash247 avatar Mar 13 '21 09:03 palash247

@TomAugspurger I'd like to work on eval for my first contribution if that is alright

rlkenny98 avatar Mar 16 '21 19:03 rlkenny98