dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

`GroupBy.take()` and other missing functions

Open Jolanrensen opened this issue 1 year ago • 3 comments

Currently, we can't do:

groupedDf
    .take(10)
    .concat()

to only concatenate the values of the first 10 groups. Instead, we'll have to convert to a normal DF first and convert back:

groupedDf
    .toDataFrame().take(10).asGroupBy()
    .concat()

The only row-based function that's available is filter(GroupedRowFilter) which can allow you to write .filter { it.index() <= 10 } but seems a bit odd.

Jolanrensen avatar May 02 '24 14:05 Jolanrensen

Other missing functions include size(), drop(), first() etc. Maybe we could make it an AnyFrame or a DataColumn/BaseColumn<GroupedDataRow>

Jolanrensen avatar May 02 '24 14:05 Jolanrensen

interestingly .filter {} runs on a GroupedRowFilter<T, G>, where T is the original DF type. This allows type-safe access to all key columns, but also to all non-key columns which don't exist in the GroupBy object, causing Exceptions... This might need a slight redesign.

Jolanrensen avatar May 06 '24 13:05 Jolanrensen

interestingly .filter {} runs on a GroupedRowFilter<T, G>, where T is the original DF type. This allows type-safe access to all key columns, but also to all non-key columns which don't exist in the GroupBy object, causing Exceptions... This might need a slight redesign.

https://github.com/Kotlin/dataframe/pull/663

koperagen avatar May 06 '24 14:05 koperagen

GroupBy has some other curiosities I found while digging into it; It has a bunch of (extension) functions for it which can be divided into three categories:

  1. Functions that run on the individual groups, for example
  • Those that transform the groups, like add {}, cumSum {}, etc.
  • Reducing functions like minBy {}, first(), last {}
  • aggregate {} belongs a bit in the first two categories, but the functions inside its DSL run on individual groups
  1. Functions that run on the entire GroupBy by columns or do something special
  • Like aggregate {}, pivot {}, concat()/concatWithKeys(), into()/toDataFrame()
  • xs(), filters columns both from the keys and groups
  • sortBy {}
  1. Functions that run on the entire GroupBy, treating it per row of keys+group
  • forEach {}, map {}/mapToFrames {}/mapToRows {}, filter {}
  • They are not mentioned on https://kotlin.github.io/dataframe/groupby.html
  • They have 3 different ways of representing a "row" of keys + group

The functions I envisioned for this issue belong in this final category, like take {}, dropWhile {}, rowsCount(), however, it's apparent this needs some extra steps:

  • We need a single way to represent an entry of keys + group
  • We need a way to distinguish between functions that run on these entries instead of on the groups. For instance, let's say we want to introduce the function GroupBy.shuffled(), now does this shuffle the rows of keys+groups, or just the rows inside each group? Or if you create GroupBy.take(5), does this take the first 5 keys+groups or the first 5 rows within each group? This is already unclear for filter {} and map {} to me.

My suggestion would be to deprecate the existing 3 entry classes and introduce a new GroupByEntry. Deprecate map, forEach, and filter for mapEntries, forEachEntry, and filterEntries respectively, using that new type.

We could try out this new mindset when creating, for instance, takeEntries(10) (as shortcut for .toDataFrame().take(10).asGroupBy()) and maybe take(10) (as shortcut for .updateGroups { take(10) }.

Jolanrensen avatar Aug 18 '25 11:08 Jolanrensen