Hash-based aggregation support for GROUP BY

Open livius2 opened this issue 8 months ago • 1 comments

Currently, Firebird processes GROUP BY clauses using sort-based aggregation, which is reliable and integrates well with ORDER BY, but can be memory and I/O-intensive for large datasets.

An alternative is hash-based aggregation strategy, which could significantly improve performance and memory usage, especially in queries involving large datasets with relatively few unique grouping keys.

Hash-based aggregation – Details

For each input record, a hash of the grouping key is calculated.
The hash table is checked for the presence of that key.
If it exists – the corresponding aggregate is updated (e.g. summing, counting).
If it doesn’t exist – a new entry is created with the initial aggregate value.
The current record is immediately discarded after processing – there’s no need to keep it in memory.

Memory usage

Memory is consumed only by: Aggregated data per group (hash_key → aggregates: sum, count, max, etc.). Internal structures of the hash table (e.g. buckets, collision chains). There is no need to buffer the full dataset or multiple rows at once. Memory usage scales linearly with the number of unique groups, not with the number of total records.

Example If a query processes 10 million records but only 1,000 unique groups, the hash table needs to hold just 1,000 entries — not all 10 million records.

It would be nice if the team implemented it :)

Apr 05 '25 07:04 livius2

It's already in the plans.

Apr 05 '25 09:04 dyemanov