drill
drill copied to clipboard
DRILL-6825: Apply different hash algorithms to different data types
This PR introduces those main changes:
- Add hash32(int index) method to the ValueVector interface.
- To Interger data types, introduce https://gist.github.com/badboy/6267743 multiply-shift algorithm which was proved to be good at performance by the paper https://15721.courses.cs.cmu.edu/spring2019/papers/17-hashjoins/richter-vldb2015.pdf. Other data types like double,float,string still use the Murmur3 hash method.
- Since HashJoin needs to solve implicit casting , it treats all numeric data types as double. So this PR only apply to HashAgg operator.
@Ben-Zvi could you review this PR ?
Since HashJoin needs to solve implicit casting,it treats all different numeric data types as double. So this PR only apply to HashAgg.
@weijietong can you add a detailed description of your work - the hash functions are a sensitive part of the execution, and possibly more people would like to look into the changes.
@Ben-Zvi I have added some description.
Adding the hash32() method to the ValueVector is useful; however picking up algorithms just based on a paper or being famous may not be good enough. At my previous employer I evaluated many hash functions by actually running (stand alone) performance and distribution tests. One clear result shown back then is that murmur performed well on long strings, less good on shorter data.
How do the new hash functions in IntegerHashing compare with the existing one in HashHelper ?
The Boost implementation of hash_combine looks "fishy" (e.g., some bits get more used than others) -- see some more critique at https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values
Why can't the seed be given directly to the hash function instead of being "combined" later ?
Another good hash function used in the past (don't recall any name) worked with a map of 256 prime numbers, and the code (starting with the seed) was using each input byte as an index to the map - rotate old value, XOR with new mapped value, continue ....
The above experience was in C++; maybe things perform differently in Java.
Also - do you know of any open source hash functions we can just import instead of writing the code in Drill ?
The IntegerHashing's method was also used in ClickHouse for integer types(see: https://github.com/yandex/ClickHouse/blob/master/dbms/src/Common/HashTable/Hash.h intHash32 method). CK does a fine hashing method choosing according to the data types and keys width which is valuable for us to learn. As you mentioned Murmur3Hash does not have a good performance at the shorter integer case.So it's better to use the IntegerHash at the integer keys case.
The Boost implementation's discussion you mentioned I had read before. But I think it's reasonable why Boost still keep the current implementation now as a base library.
The reason to keep seed away from the hash32 function and involve the Boost's hash_combine method is that I want to change the current hashing strategy later. I plan to change the hash32(hash32(hash32)) row iterate model to hash32() hash_combine hash32() hash_combine hash32()
column combine model at the multi-keys case. The row iterate module has a data dependency and will hurt the cpu pipeline performance.
Other hashing methods I know can be found here: https://github.com/benalexau/hash-bench. It's a java hashing method collection. The benchmark I run showed that https://github.com/OpenHFT/Zero-Allocation-Hashing/blob/master/src/main/java/net/openhft/hashing/LongHashFunction.java 's city_1_1 has a best performance at 32,64 bytes key width.
I also wonder whether we can do the join keys data type implication at the project node later. So the HashJoin and Exchange node can also benefit from this PR.
@Ben-Zvi I have done a benchmark about IntegerHash and MurmurHash3_32. The result shows that IntegerHash has nearly 2 times performance than MurmurHash3_32.
HashBenchInteger.hashInteger IntegerHash N/A avgt 5 3.987 ± 0.162 ns/op
HashBenchInteger.hashInteger Murmur3_32 N/A avgt 5 6.626 ± 0.085 ns/op
HashBenchInteger.hashLong IntegerHash N/A avgt 5 4.903 ± 0.320 ns/op
HashBenchInteger.hashLong Murmur3_32 N/A avgt 5 8.525 ± 0.649 ns/op
The city_1_1 hash algorithm mentioned above only has a better performance than Murmur3_32 but not good than IntegerHash.
I also run two sql queries comparison which are using IntegerHash and MurmurHash separately to the long data types. The queries are hashing hotspot: Q1:
select c.c_custkey, count(*)
from dfs.`/tpch100/customer` c
group by c.c_custkey
Q2:
select c.c_custkey,c.c_nationkey, count(*)
from dfs.`/tpch100/customer` c
group by c.c_custkey,c.c_nationkey
The dataset is tpc-h scale 100. The query result shows that : To Q1: IntegerHash has a 5% query performance improvement than MurmurHash3_32(IntegerHash: 14.029 sec, MurmurHash3_32: 14.870 sec ). To Q2: IntegerHash has a 16% query performance improvement than MurmurHash3_32(IntegerHash: 20.499 sec,MurmurHash3_32: 23.921 sec).
It is clear that the more integer datatype columns grouped by, the more query performance will be gained.
@Ben-Zvi The hash code computation of HashAgg has been moved to HashAggTemplate
. Your review advices have been applied. Please check, tks!
@weijietong - how about postponing this work to 1.17.0 ? It is a change in a critical part of the engine, and we need to get a better understanding of the impact, and 1.16.0 is too close. (Also I am very busy with some 1.16 work now, and have no time to test myself). In 1.17.0 the work can be extended to Hash-Join and exchanges to be more complete.
@Ben-Zvi agree with you. I have changed the issue to 1.17.0
@Ben-Zvi Do you have the time to go on this work ?
@weijietong that's the status of this PR?