dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Effecient KotlinNotebokPluginUtils.sortByColumns

Open koperagen opened this issue 3 months ago • 2 comments

Take 10_000_000 rows, open it in a table widget in notebooks and sort by a column Sorting and loading take a lot of time. I don't have a profile so i can't say for sure where's the actual bottleneck, it needs to be investigated However this method performs sorting of the entire dataframe, all 10 million rows even when only 20 or 100 are going to be displayed. There's more efficient algorithms for such situations, for example least or greatest from Guava: https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/collect/Comparators.html#least(int,java.util.Comparator)

koperagen avatar Sep 09 '25 11:09 koperagen

Good idea!

We use a similar algorithm for quick select in our percentile/median/quantile implementation: https://github.com/Kotlin/dataframe/blob/b46524691922c1c49c5258b2f74d7ac8aa817c85/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/math/quantile.kt#L281 though this only returns a single element.

According to the TopKSelect source their solution uses less memory than quickselect :)

Jolanrensen avatar Sep 09 '25 11:09 Jolanrensen

Worth to look at serialization/deserialization too, just sorting 10 million rows shouldn't really take long, so i suspect something else affecting the performance

koperagen avatar Sep 09 '25 11:09 koperagen