`GroupBy.take()` and other missing functions
Currently, we can't do:
groupedDf
.take(10)
.concat()
to only concatenate the values of the first 10 groups. Instead, we'll have to convert to a normal DF first and convert back:
groupedDf
.toDataFrame().take(10).asGroupBy()
.concat()
The only row-based function that's available is filter(GroupedRowFilter) which can allow you to write .filter { it.index() <= 10 } but seems a bit odd.
Other missing functions include size(), drop(), first() etc.
Maybe we could make it an AnyFrame or a DataColumn/BaseColumn<GroupedDataRow>
interestingly .filter {} runs on a GroupedRowFilter<T, G>, where T is the original DF type. This allows type-safe access to all key columns, but also to all non-key columns which don't exist in the GroupBy object, causing Exceptions... This might need a slight redesign.
interestingly
.filter {}runs on aGroupedRowFilter<T, G>, whereTis the original DF type. This allows type-safe access to all key columns, but also to all non-key columns which don't exist in the GroupBy object, causing Exceptions... This might need a slight redesign.
https://github.com/Kotlin/dataframe/pull/663
GroupBy has some other curiosities I found while digging into it; It has a bunch of (extension) functions for it which can be divided into three categories:
- Functions that run on the individual groups, for example
- Those that transform the groups, like
add {},cumSum {}, etc. - Reducing functions like
minBy {},first(),last {} aggregate {}belongs a bit in the first two categories, but the functions inside its DSL run on individual groups
- Functions that run on the entire
GroupByby columns or do something special
- Like
aggregate {},pivot {},concat()/concatWithKeys(),into()/toDataFrame() xs(), filters columns both from the keys and groupssortBy {}
- Functions that run on the entire
GroupBy, treating it per row of keys+group
forEach {},map {}/mapToFrames {}/mapToRows {},filter {}- They are not mentioned on https://kotlin.github.io/dataframe/groupby.html
- They have 3 different ways of representing a "row" of keys + group
The functions I envisioned for this issue belong in this final category, like take {}, dropWhile {}, rowsCount(), however, it's apparent this needs some extra steps:
- We need a single way to represent an entry of keys + group
- We need a way to distinguish between functions that run on these entries instead of on the groups. For instance, let's say we want to introduce the function
GroupBy.shuffled(), now does this shuffle the rows of keys+groups, or just the rows inside each group? Or if you createGroupBy.take(5), does this take the first 5 keys+groups or the first 5 rows within each group? This is already unclear forfilter {}andmap {}to me.
My suggestion would be to deprecate the existing 3 entry classes and introduce a new GroupByEntry. Deprecate map, forEach, and filter for mapEntries, forEachEntry, and filterEntries respectively, using that new type.
We could try out this new mindset when creating, for instance, takeEntries(10) (as shortcut for .toDataFrame().take(10).asGroupBy()) and maybe take(10) (as shortcut for .updateGroups { take(10) }.