dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Fix concat

Open AndreiKingsley opened this issue 1 year ago • 11 comments

concat removes key column entirely (name and values)

image image

AndreiKingsley avatar Apr 23 '24 14:04 AndreiKingsley

Maybe it's useful to add GroupBy.origin: DataFrame? that returns original dataframe if it was created via DataFrame.groupBy()

AndreiKingsley avatar Apr 23 '24 14:04 AndreiKingsley

I think this is the intended behavior. The key of the group is something temporary and usually consists of columns already in the DF. We are working on a way to access the group keys from aggregate though (https://github.com/Kotlin/dataframe/pull/662), maybe that can be a nice alternative.

The original DataFrame can be retrieved using concat (albeit with a different order perhaps).

Jolanrensen avatar Apr 23 '24 18:04 Jolanrensen

Ok, anyway new concat is needed for the purpose I described.

AndreiKingsley avatar Apr 23 '24 18:04 AndreiKingsley

This's my current solution https://github.com/Kotlin/kandy/blob/main/kandy-api/src/main/kotlin/org/jetbrains/kotlinx/kandy/dsl/internal/concatFixed.kt

AndreiKingsley avatar Apr 23 '24 18:04 AndreiKingsley

maybe a concatWithKeys() would be a nice addition?

Jolanrensen avatar Apr 24 '24 13:04 Jolanrensen

I think it won't hurt to make do it by default. One might say that df.groupBy { expr { } } is a shortcut for df.add() { }.groupBy {}

koperagen avatar Apr 24 '24 14:04 koperagen

if we do it by default, then we would get duplicate columns, because the key columns are often in the groups as well

Jolanrensen avatar Apr 24 '24 14:04 Jolanrensen

Andrey's implementation only adds "new" columns (or so i understood)

koperagen avatar Apr 24 '24 14:04 koperagen

But then, what qualifies as "new"?

  • groupBy { expr { myCol } }, yes
  • but groupBy { myCol + 1 }?
  • or groupBy { myCol named "other" }

I think we should be careful here

Jolanrensen avatar Apr 25 '24 10:04 Jolanrensen

There's also the case where a user creates a new expr column with a duplicate name that should still be kept, so my suggestion is the following: Create a concatWithKeys() that will add all key columns to the front of the groups regardless of whether they were in the DF already. Avoid naming clashes by using the ColumnNameGenerator, for instance with DynamicDataFrameBuilder.

Something like:

internal fun GroupBy<*, *>.concatWithKeys(): DataFrame<*> =
    mapToFrames {
        DynamicDataFrameBuilder()
            .apply {
                for (column in group.columns()) {
                    add(column)
                }
                val rowsCount = group.rowsCount()
                for ((name, value) in key.toMap()) {
                    add(List(rowsCount) { value }.toColumn(name))
                }
            }
            .toDataFrame()
            .moveToLeft { takeLast(key.count()) }
    }.concat()

Jolanrensen avatar May 06 '24 19:05 Jolanrensen

Alternatively, what's arguably a lot simpler, we could just explode the groups column. Like:

internal fun GroupBy<*, *>.concatWithKeys(): DataFrame<*> =
    toDataFrame().explode { groups }

This will generate extra key values where necessary and keep the grouped columns in a column group, avoiding any potential name clashes :).

Jolanrensen avatar May 07 '24 10:05 Jolanrensen

@AndreiKingsley could you please finish the story

zaleslaw avatar Mar 11 '25 18:03 zaleslaw