dataframe-api Reductions

Next are listed the reductions over numerical types defined in pandas. These can be applied:

To Series
To N columns of a DataFrame
To group by operations
As window functions (window, rolling, expanding or ewm)
In resample operations

pandas is not consistent, in letting any reduction to be applied to any of the above. Each method is independent (Series.sum, GroupBy.sum, Window.sum...). Some reductions are not implemented for some of the classes. And the signatures can change (e.g. Series.var(ddof) vs EWM.var(bias))

I propose to have standard signatures for the reductions, and have all reductions available to all classes.

Reductions for numerical data types and proposed signatures

all()
any()
count()
nunique() # may be the name could be count_unique, count_distinct...?
mode() # what to do if there is more than one mode? Ideally we would like all reductions to return a scalar
min()
max()
median()
quantile(q, interpolation='linear') # in pandas q is by default 0.5, but I think it's better to require it; interpolation can be {‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}
sum()
prod()
mean()
var(ddof=1) # delta degrees of freedom (for some classes bias is used)
std(ddof=1)
skew()
kurt() # pandas has also the alias kurtosis
sem(ddof=1) # standard error of the mean
mad() # mean absolute deviation
autocorr(lag=1)
is_unique() # in pandas is a property
is_monotonic() # in pandas is a property
is_monotonic_decreasing() # in pandas is a property
is_monotonic_increasing() # in pandas is a property

Reductions that may depend on row labels (and could potentially return a list, like mode):

idxmax() / argmax()
idxmin() / argmin()

These need an extra column other:

cov(other, ddof=1)
corr(other, method='pearson') # method can be {‘pearson’, ‘kendall’, ‘spearman’}

Questions

Allow reductions over rows, or only over columns?
What to do with NA?
pandas has parameters (bool_only, numeric_only) to let only apply the operation over columns of certain types only. Do we want it?
- I think something like df.select_columns_by_dtype(int).sum() would be preferrable than a parameter to all or some reductions
pandas has a level parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?
pandas has a min_count/min_periods parameter in some reductions (e.g. sum, min), to return NA if less than min_count values are present. Do we want to keep it?
How should reductions be applied?
- In the top-level namespace, as pandas (e.g. df[col].sum())
- Using an accessor (e.g. df[col].reduce.sum())
- Having a reduce function, and passing the specific functions as a parameter (e.g. df[col].reduce(sum))
- Other ideas
Would it make sense to have a third-party package implementing reductions that can be reused by projects?

Frequency of usage

pandas_reductions

Jun 05 '20 00:06 datapythonista

Thanks Marc! My responses to some of your questions below:

mode() # what to do if there is more than one mode? Ideally we would like all reductions to return a scalar

I would leave mode out, since it's not always a reduction.

Allow reductions over rows, or only over columns?

I think only over rows, at least initially.

What to do with NA?

Match pandas: skip NA values by default, but provide an option to return NA if there are any missing values.

pandas has parameters to let only apply the operation over columns of certain types only.

This is worth discussing in detail. I think I'd agree with you that the reduction should apply to all the columns, raising if that reduction isn't implemented for a specific dtype.

pandas has a level parameter in many reductions, for MultiIndex. If Indexing/MultiIndexing is part of the API, do we want to have it?

I think leave it out. AFAIK, df.sum(level="A") is equivalent to df.groupby(level="A").sum().

Do we want to keep [min_count]?

Yes, I think that's important. And I'd prefer to keep the pandas behavior of defining the sum of an empty column to be 1.

>>> pd.Series([]).sum()
0.0

Similarly, the product of an empty column is 1. If not at least min_count non-NA observations are present then an NA result is returned.

We arrived at that behavior after a long (contentious) discussion: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html.

How should reductions be applied?

Methods like .sum(), etc.

Would it make sense to have a third-party package implementing reductions that can be reused by projects?

I'm not sure how feasible this is, but it's probably out of scope for this project regardless.

Jun 05 '20 16:06 TomAugspurger

Really great writeup @datapythonista and great comments @TomAugspurger.

In the in-person call, I mentioned that we should try to be as explicit as possible while also being extensible. I can talk at a high level about how we do this in Modin.

For explicitness, we have a query compiler layer that handles different types of queries. Let's look at __add__ for example:

df + 1  # Can be applied to each cell individually
df + df2  # There must be an alignment, or join, followed by an add on collisions

These do not have the same API Modin's query compiler layer, which is the layer that other dataframe systems implement. The pandas API layer of Modin is the "consumer" of the dataframe in this case and knows that when it is passed an integer for __add__ it should call a different query compiler method than it would for a dataframe.

This is more explicit than having an overloaded API that accepts many different types for other. From an end-user perspective this is not necessarily friendly, but for a library consuming a dataframe, it might be exactly what they want.

For extensibility, in Modin, we have Function objects that can be extended and registered with the query compiler layer. Each implementation can have its own Function object at that layer. So in the Reduction case, we have a ReductionFunction. New functions can be registered with the query compiler at runtime (example). We also internally use this register logic to register a set of functions when the class is created (link).

This is, at a high level, how we try to handle things in Modin. Not everything can be expressed in an API, but it would be better for "power users" or library developers who consume dataframes to have this more expressive Function interface than it is to have a generic apply.

Jun 12 '20 16:06 devin-petersohn

Trying to piece things together in my head here, but if we're just concerned with the user-facing API, do we need to worry about things like dispatching on the type of other in DataFrame.__add__? Isn't that solely a backend issue?

IMO, the API decision to be made here is between a single top-level method vs. many methods vs. an accessor

# single top-level method
>>> df.reduce("sum")

# many methods
>>> df.sum()

# accessor
>>> df.reduce.sum()

Given that NumPy and pandas, already implement these as .sum(), .mean(), etc. I'd say we should do the same unless we have a compelling reason not to.

Thinking more, @devin-petersohn does https://github.com/pydata-apis/dataframe-api/issues/11#issuecomment-643382802 come up in the application of opaque UDFs, something like

In [2]: df = pd.DataFrame({"A": ['a', 'a', 'b'], "B": [1, 2, 3]})

In [3]: df.groupby("A").apply(func)

Jun 15 '20 12:06 TomAugspurger

Thinking more, @devin-petersohn does #11 (comment) come up in the application of opaque UDFs, something like

Yes, sort of. There needs to be some kind of standard for the UDF as well, otherwise consumers will need to be aware of what system they are executing things on.

I agree that we want to focus on the top-level methods first, but we will need a way to let users define their own Reductions, for example, otherwise users will convert to some external structure to do their computation and then convert it back. My comment probably got a bit ahead of the conversation with the focus on extensibility, it might be a distraction.

Jun 17 '20 15:06 devin-petersohn