opteryx
opteryx copied to clipboard
✨ rewrite GROUP BY
the current GROUP BY functionality is the one from pyarrow, whilst this is performant, it has limitations which aren't ideal such as not being able to GROUP BY on relations with ARRAY or STRUCT columns.
It also requires the entire dataset to be loaded into memory to GROUP BY, which isn't ideal.
We can probably get reasonable performance for an initial removal of pyarrow using a hash map and in place aggregation in a morsel-by-morsel execution.
We would need two maps
The map from the GROUP BY tuple to the group key and the group key to the value of the agg.
Our initial Python and pyarrow rewrite of the group by isn't very performant, but we weren't expecting it to be.
We're using arrow to do incremental group by and then combining using python dicts.
This is about 7x slower for our 10m test dataset but about 30% faster for our 100m row dataset. Cardinality will be a huge factor as well as size.
Enhancements to try include.
- using group by to combine partials
- replace group by with cython