Super columns as nested dataframes

Open jack-pappas opened this issue 5 years ago • 1 comments

One optional feature to consider including in our specification is nested "DataFrames"/Tables (or whatever name we decide to use there).

riptide does not currently support this concept, but I've been thinking recently that maybe it ought to since it provides a cleaner, more elegant solution for supporting "super columns".

Such "super columns" arise (for example) when performing multiple aggregates on a DataFrame, especially in the common case of grouping by the values/keys in one or more columns then calculating per-group reductions over some subset of the remaining columns.

pandas currently handles this scenario using the concepts of index "levels" and row labeling. This solves the problem but adds a good bit of complexity to the API, including having a stateful DataFrame class. (One could argue the statefulness of DataFrame is a pandas implementation detail of the approach and not inherently a drawback of the approach itself.)

riptide currently handles this scenario by having a Multiset class derived from our Dataset class, where Multiset is basically just a dictionary of named Datasets. This works ok and isn't that far removed from the notion of having nested Dataset instances -- and if you squint just right when you look at Multiset, it's not really that different from the pandas system of index levels + row labels. However, Multiset has it's own drawbacks. Most notably, deriving from our Dataset class means any function that knows how to operate on a Dataset also (in many cases) needs to know how to work with a Multiset in order to produce the "correct" output (in terms of the type, dimensions, and data) -- what's the expected output if one calls a 'merge'-type function with a Multiset and a Dataset, or with two Multisets?

pyarrow's documentation for pandas interoperability says it's Table class already supports nested DataFrames / column groups, although that's the only mention of this behavior I can find.

Nested DataFrames provide a clean solution for representing the "super columns" resulting from these multi-aggregation operations; specifically, they:

Remove the complexity of index levels / row labels, or at least the need to support them. This means the spec will be simpler and easier to implement.
Fix the messy OO / inheritance situation I described above with riptide's Multiset class. Each "column" in a DataFrame is either a 1D array (or maybe an array scalar?) or another DataFrame of the same length.
Allow for (some) unification of the array and DataFrame APIs. For example, implementing a function like .sum() on a DataFrame can distribute that function call over it's columns by just iterating over them and calling .sum() on each of them.
(maybe) Could be used as a way to interoperate with other DataFrame implementations which do allow multiple occurrences of a column with the same name. When converted to the nested representation, those columns might (e.g.) get names like "1" and "2", "x" and "y", or use the name of the DataFrame they came from, then they're added to a new DataFrame containing just the two of columns, then that's added as a (nested) column to the resulting/outer DataFrame.

One way the nesting makes things more complicated (maybe) is what to return from a property like DataFrame.num_cols -- should it be the number of columns as seen by that DataFrame instance, or should it be a flattened value (so counting the columns from any nested DataFrames as well). I think this could be disambiguated by having two properties like DataFrame.num_cols and DataFrame.num_cols_flat.

Jun 17 '20 21:06 jack-pappas

Can you clarify the difference between nested data and the multiple aggregations use-case?

To me, nested data typically means the values in the dataframe are some sort of collection themselves (lists, dicts, dataframes, etc). For example, a DataFrame with one column and two rows, where each row is a DataFrame with two columns ['a', 'b']

In [10]: df = pd.DataFrame({"A": [pd.DataFrame({"a": [1, 2], "b": [3, 4]}), pd.DataFrame({"A": [1], "B": [2]})]})

In [12]: df.iloc[0, 0]  # slice to get the "scalar" dataframe
Out[12]:
   a  b
0  1  3
1  2  4

This (I think) is different from what pandas calls hierarchical labels, which can arise when performing multiple aggregations on multiple columns.

In [13]: df = pd.DataFrame({"A": ['a', 'a', 'b', 'b'], "B": [1, 2, 3, 4], "C": [1, 1, 1, 1]})

In [14]: df
Out[14]:
   A  B  C
0  a  1  1
1  a  2  1
2  b  3  1
3  b  4  1

In [15]: df.groupby("A").agg(['sum', 'mean'])
Out[15]:
    B        C
  sum mean sum mean
A
a   3  1.5   2    1
b   7  3.5   2    1

Is there an equivalence between DataFrames storing DataFrames and what pandas calls hierarchical labels (which I think count as "super columns"), or are these two distinct concepts?

As an aside, I think many pandas maintainers & users are unhappy with the complexity induced by hierarchical column labels. So we'd probably recommend that aggregation be written as something like.

In [17]: df.groupby("A").agg(B_sum=("B", "sum"), B_mean=("B", "mean"), C_sum=("C", "sum"), C_mean=("C", "mean"))
Out[17]:
   B_sum  B_mean  C_sum  C_mean
A
a      3     1.5      2       1
b      7     3.5      2       1

Jun 18 '20 13:06 TomAugspurger