koalas Add 'list' to agg functions

Given:

df = ks.DataFrame({'A': [1, 1, 2, 2],  'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
aggregated = df.groupby('A').agg({'B': 'list'})

Should return:

aggregated.B
>>> [['x', 'x'],
['x', 'y']]

So that I may operate on a list of items from a groupBy operation.

Oct 04 '19 15:10 DataDave-datajoi

Seems like pandas doesn't support it as well:

>>> import pandas as pd
>>> pdf = pd.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
>>> pdf.groupby('A').agg({'B': 'list'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 1315, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 186, in aggregate
    result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/base.py", line 498, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/base.py", line 449, in _agg
    result[fname] = func(fname, agg_how)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/base.py", line 432, in _agg_1dim
    return colg.aggregate(how, _level=(_level or 0) + 1)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 760, in aggregate
    return getattr(self, func_or_funcs)(*args, **kwargs)
  File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 536, in __getattr__
    (type(self).__name__, attr))
AttributeError: 'SeriesGroupBy' object has no attribute 'list'

Btw, we have a workaround here, we can call Spark's aggregate functions:

>>> kdf = ks.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
>>> aggregated = kdf.groupby('A').agg({'B': 'collect_list'})
>>> aggregated
        B
A
1  [x, x]
2  [x, y]
>>> aggregated.B
A
1    [x, x]
2    [x, y]
Name: B, dtype: object

Oct 04 '19 18:10 ueshin

emm, but pandas does support something like this, not sure if this is what he meant:

pdf = pd.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
pdf.groupby('A').agg({'B': list})

Not sure if Koalas wants to support such, but I do love this workaround! 😃

Oct 04 '19 19:10 charlesdong1991

ah, I see. we might want to support it. cc @HyukjinKwon

Oct 04 '19 19:10 ueshin

@charlesdong1991 correct, I meant list not 'list'

Oct 04 '19 22:10 DataDave-datajoi

Seems #726 is related to this issue.

Oct 04 '19 23:10 ueshin

@ueshin But Pandas doesn't support 'collect_list' as an aggregate function. It would be awesome if the Pandas code could run in Koalas without changes.. I think the best solution would be for Koalas to support lambda functions as aggregate functions.

Oct 28 '21 14:10 amorimds