Add 'list' to agg functions
Given:
df = ks.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
aggregated = df.groupby('A').agg({'B': 'list'})
Should return:
aggregated.B
>>> [['x', 'x'],
['x', 'y']]
So that I may operate on a list of items from a groupBy operation.
Seems like pandas doesn't support it as well:
>>> import pandas as pd
>>> pdf = pd.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
>>> pdf.groupby('A').agg({'B': 'list'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 1315, in aggregate
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 186, in aggregate
result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/base.py", line 498, in _aggregate
result = _agg(arg, _agg_1dim)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/base.py", line 449, in _agg
result[fname] = func(fname, agg_how)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/base.py", line 432, in _agg_1dim
return colg.aggregate(how, _level=(_level or 0) + 1)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 760, in aggregate
return getattr(self, func_or_funcs)(*args, **kwargs)
File "/Users/ueshin/workspace/databricks-koalas/miniconda/envs/databricks-koalas_3.6/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 536, in __getattr__
(type(self).__name__, attr))
AttributeError: 'SeriesGroupBy' object has no attribute 'list'
Btw, we have a workaround here, we can call Spark's aggregate functions:
>>> kdf = ks.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
>>> aggregated = kdf.groupby('A').agg({'B': 'collect_list'})
>>> aggregated
B
A
1 [x, x]
2 [x, y]
>>> aggregated.B
A
1 [x, x]
2 [x, y]
Name: B, dtype: object
emm, but pandas does support something like this, not sure if this is what he meant:
pdf = pd.DataFrame({'A': [1, 1, 2, 2], 'B': ['x', 'x', 'x', 'y']}, columns=['A', 'B'])
pdf.groupby('A').agg({'B': list})
Not sure if Koalas wants to support such, but I do love this workaround! 😃
ah, I see. we might want to support it. cc @HyukjinKwon
@charlesdong1991 correct, I meant list not 'list'
Seems #726 is related to this issue.
@ueshin But Pandas doesn't support 'collect_list' as an aggregate function. It would be awesome if the Pandas code could run in Koalas without changes.. I think the best solution would be for Koalas to support lambda functions as aggregate functions.