opteryx ✨ rewrite GROUP BY

✨ rewrite GROUP BY

Open joocer opened this issue 1 year ago • 2 comments

trafficstars

the current GROUP BY functionality is the one from pyarrow, whilst this is performant, it has limitations which aren't ideal such as not being able to GROUP BY on relations with ARRAY or STRUCT columns.

It also requires the entire dataset to be loaded into memory to GROUP BY, which isn't ideal.

Aug 07 '24 10:08 joocer

We can probably get reasonable performance for an initial removal of pyarrow using a hash map and in place aggregation in a morsel-by-morsel execution.

We would need two maps

The map from the GROUP BY tuple to the group key and the group key to the value of the agg.

Jan 06 '25 18:01 joocer

Our initial Python and pyarrow rewrite of the group by isn't very performant, but we weren't expecting it to be.

We're using arrow to do incremental group by and then combining using python dicts.

This is about 7x slower for our 10m test dataset but about 30% faster for our 100m row dataset. Cardinality will be a huge factor as well as size.

Enhancements to try include.

using group by to combine partials
replace group by with cython

Jan 27 '25 07:01 joocer

opteryx opteryx copied to clipboard

✨ rewrite GROUP BY

opteryx
opteryx copied to clipboard