dataframe `GroupBy.take()` and other missing functions

Currently, we can't do:

groupedDf
    .take(10)
    .concat()

to only concatenate the values of the first 10 groups. Instead, we'll have to convert to a normal DF first and convert back:

groupedDf
    .toDataFrame().take(10).asGroupBy()
    .concat()

The only row-based function that's available is filter(GroupedRowFilter) which can allow you to write .filter { it.index() <= 10 } but seems a bit odd.

May 02 '24 14:05 Jolanrensen

Other missing functions include size(), drop(), first() etc. Maybe we could make it an AnyFrame or a DataColumn/BaseColumn<GroupedDataRow>

May 02 '24 14:05 Jolanrensen

interestingly .filter {} runs on a GroupedRowFilter<T, G>, where T is the original DF type. This allows type-safe access to all key columns, but also to all non-key columns which don't exist in the GroupBy object, causing Exceptions... This might need a slight redesign.

May 06 '24 13:05 Jolanrensen

interestingly .filter {} runs on a GroupedRowFilter<T, G>, where T is the original DF type. This allows type-safe access to all key columns, but also to all non-key columns which don't exist in the GroupBy object, causing Exceptions... This might need a slight redesign.

https://github.com/Kotlin/dataframe/pull/663

May 06 '24 14:05 koperagen

GroupBy has some other curiosities I found while digging into it; It has a bunch of (extension) functions for it which can be divided into three categories:

Functions that run on the individual groups, for example

Those that transform the groups, like add {}, cumSum {}, etc.
Reducing functions like minBy {}, first(), last {}
aggregate {} belongs a bit in the first two categories, but the functions inside its DSL run on individual groups

Functions that run on the entire GroupBy by columns or do something special

Like aggregate {}, pivot {}, concat()/concatWithKeys(), into()/toDataFrame()
xs(), filters columns both from the keys and groups
sortBy {}

Functions that run on the entire GroupBy, treating it per row of keys+group

forEach {}, map {}/mapToFrames {}/mapToRows {}, filter {}
They are not mentioned on https://kotlin.github.io/dataframe/groupby.html
They have 3 different ways of representing a "row" of keys + group

The functions I envisioned for this issue belong in this final category, like take {}, dropWhile {}, rowsCount(), however, it's apparent this needs some extra steps:

We need a single way to represent an entry of keys + group
We need a way to distinguish between functions that run on these entries instead of on the groups. For instance, let's say we want to introduce the function GroupBy.shuffled(), now does this shuffle the rows of keys+groups, or just the rows inside each group? Or if you create GroupBy.take(5), does this take the first 5 keys+groups or the first 5 rows within each group? This is already unclear for filter {} and map {} to me.

My suggestion would be to deprecate the existing 3 entry classes and introduce a new GroupByEntry. Deprecate map, forEach, and filter for mapEntries, forEachEntry, and filterEntries respectively, using that new type.

We could try out this new mindset when creating, for instance, takeEntries(10) (as shortcut for .toDataFrame().take(10).asGroupBy()) and maybe take(10) (as shortcut for .updateGroups { take(10) }.

Aug 18 '25 11:08 Jolanrensen