ArcticDB
ArcticDB copied to clipboard
Performance 18254756429: Improve hash grouping aggregation parallelism
Reference Issues/PRs
What does this implement or fix?
Poor quality hash implementations of integral types, including at least some implementations of std::hash are basically a static cast. e.g. std::hash<int64_t>{}(100) == 100. This is fast, but leads to poor distributions in our bucketing, where we mod the hash with the number of buckets. In particular, if performing a grouping hash on a timeseries where the time points are dates results in all of the rows being partitioned into bucket zero, which then results in no parallelism in the aggregation clause.
Swap to using a consistent hash function across all supported platforms with improved uniformity.