dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Reductions

Open datapythonista opened this issue 5 years ago • 4 comments

Next are listed the reductions over numerical types defined in pandas. These can be applied:

  • To Series
  • To N columns of a DataFrame
  • To group by operations
  • As window functions (window, rolling, expanding or ewm)
  • In resample operations

pandas is not consistent, in letting any reduction to be applied to any of the above. Each method is independent (Series.sum, GroupBy.sum, Window.sum...). Some reductions are not implemented for some of the classes. And the signatures can change (e.g. Series.var(ddof) vs EWM.var(bias))

I propose to have standard signatures for the reductions, and have all reductions available to all classes.

Reductions for numerical data types and proposed signatures

  • all()
  • any()
  • count()
  • nunique() # may be the name could be count_unique, count_distinct...?
  • mode() # what to do if there is more than one mode? Ideally we would like all reductions to return a scalar
  • min()
  • max()
  • median()
  • quantile(q, interpolation='linear') # in pandas q is by default 0.5, but I think it's better to require it; interpolation can be {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
  • sum()
  • prod()
  • mean()
  • var(ddof=1) # delta degrees of freedom (for some classes bias is used)
  • std(ddof=1)
  • skew()
  • kurt() # pandas has also the alias kurtosis
  • sem(ddof=1) # standard error of the mean
  • mad() # mean absolute deviation
  • autocorr(lag=1)
  • is_unique() # in pandas is a property
  • is_monotonic() # in pandas is a property
  • is_monotonic_decreasing() # in pandas is a property
  • is_monotonic_increasing() # in pandas is a property

Reductions that may depend on row labels (and could potentially return a list, like mode):

  • idxmax() / argmax()
  • idxmin() / argmin()

These need an extra column other:

  • cov(other, ddof=1)
  • corr(other, method='pearson') # method can be {‘pearson’, ‘kendall’, ‘spearman’}

Questions

  • Allow reductions over rows, or only over columns?
  • What to do with NA?
  • pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?
    • I think something like df.select_columns_by_dtype(int).sum() would be preferrable than a parameter to all or some reductions
  • pandas has a level parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?
  • pandas has a min_count/min_periods parameter in some reductions (e.g. sum, min), to return NA if less than min_count values are present. Do we want to keep it?
  • How should reductions be applied?
    • In the top-level namespace, as pandas (e.g. df[col].sum())
    • Using an accessor (e.g. df[col].reduce.sum())
    • Having a reduce function, and passing the specific functions as a parameter (e.g. df[col].reduce(sum))
    • Other ideas
  • Would it make sense to have a third-party package implementing reductions that can be reused by projects?

Frequency of usage

pandas_reductions

datapythonista avatar Jun 05 '20 00:06 datapythonista

Thanks Marc! My responses to some of your questions below:

mode() # what to do if there is more than one mode? Ideally we would like all reductions to return a scalar

I would leave mode out, since it's not always a reduction.

Allow reductions over rows, or only over columns?

I think only over rows, at least initially.

What to do with NA?

Match pandas: skip NA values by default, but provide an option to return NA if there are any missing values.

pandas has parameters to let only apply the operation over columns of certain types only.

This is worth discussing in detail. I think I'd agree with you that the reduction should apply to all the columns, raising if that reduction isn't implemented for a specific dtype.

pandas has a level parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?

I think leave it out. AFAIK, df.sum(level="A") is equivalent to df.groupby(level="A").sum().

Do we want to keep [min_count]?

Yes, I think that's important. And I'd prefer to keep the pandas behavior of defining the sum of an empty column to be 1.

>>> pd.Series([]).sum()
0.0

Similarly, the product of an empty column is 1. If not at least min_count non-NA observations are present then an NA result is returned.

We arrived at that behavior after a long (contentious) discussion: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html.

How should reductions be applied?

Methods like .sum(), etc.

Would it make sense to have a third-party package implementing reductions that can be reused by projects?

I'm not sure how feasible this is, but it's probably out of scope for this project regardless.

TomAugspurger avatar Jun 05 '20 16:06 TomAugspurger

Really great writeup @datapythonista and great comments @TomAugspurger.

In the in-person call, I mentioned that we should try to be as explicit as possible while also being extensible. I can talk at a high level about how we do this in Modin.

For explicitness, we have a query compiler layer that handles different types of queries. Let's look at __add__ for example:

df + 1  # Can be applied to each cell individually
df + df2  # There must be an alignment, or join, followed by an add on collisions

These do not have the same API Modin's query compiler layer, which is the layer that other dataframe systems implement. The pandas API layer of Modin is the "consumer" of the dataframe in this case and knows that when it is passed an integer for __add__ it should call a different query compiler method than it would for a dataframe.

This is more explicit than having an overloaded API that accepts many different types for other. From an end-user perspective this is not necessarily friendly, but for a library consuming a dataframe, it might be exactly what they want.

For extensibility, in Modin, we have Function objects that can be extended and registered with the query compiler layer. Each implementation can have its own Function object at that layer. So in the Reduction case, we have a ReductionFunction. New functions can be registered with the query compiler at runtime (example). We also internally use this register logic to register a set of functions when the class is created (link).

This is, at a high level, how we try to handle things in Modin. Not everything can be expressed in an API, but it would be better for "power users" or library developers who consume dataframes to have this more expressive Function interface than it is to have a generic apply.

devin-petersohn avatar Jun 12 '20 16:06 devin-petersohn

Trying to piece things together in my head here, but if we're just concerned with the user-facing API, do we need to worry about things like dispatching on the type of other in DataFrame.__add__? Isn't that solely a backend issue?

IMO, the API decision to be made here is between a single top-level method vs. many methods vs. an accessor

# single top-level method
>>> df.reduce("sum")

# many methods
>>> df.sum()

# accessor
>>> df.reduce.sum()

Given that NumPy and pandas, already implement these as .sum(), .mean(), etc. I'd say we should do the same unless we have a compelling reason not to.


Thinking more, @devin-petersohn does https://github.com/pydata-apis/dataframe-api/issues/11#issuecomment-643382802 come up in the application of opaque UDFs, something like

In [2]: df = pd.DataFrame({"A": ['a', 'a', 'b'], "B": [1, 2, 3]})

In [3]: df.groupby("A").apply(func)

TomAugspurger avatar Jun 15 '20 12:06 TomAugspurger

Thinking more, @devin-petersohn does #11 (comment) come up in the application of opaque UDFs, something like

Yes, sort of. There needs to be some kind of standard for the UDF as well, otherwise consumers will need to be aware of what system they are executing things on.

I agree that we want to focus on the top-level methods first, but we will need a way to let users define their own Reductions, for example, otherwise users will convert to some external structure to do their computation and then convert it back. My comment probably got a bit ahead of the conversation with the focus on extensibility, it might be a distraction.

devin-petersohn avatar Jun 17 '20 15:06 devin-petersohn