hawq
hawq copied to clipboard
HAWQ-1660. Optimize parquet scan when bloom filter enabled.
It is a good optimization point. If a lot of columns will be projected, we can only fetch joinkey and do a bloomfilter check, if doesn't match, no need to fetch other columns.
But in this PR, if bloomfilter is not enable, it will fetch joinkey in the first loop, and fetch other columns in the second loop, which needs a little refine further.
@kuien I do a perf test on your pr, two issues:
- query result error
- performance downgrade
Details see below, please check code, thanks.
TPCH1G data on my mac, master code
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
6088
(1 row)
Time: 3150.873 ms
tpch=# set hawq_hashjoin_bloomfilter to on;
SET
Time: 2.903 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
6088
(1 row)
Time: 1512.782 ms
your code
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
6088
(1 row)
Time: 49466.999 ms #<-- result ok, but bad performance
tpch=# set hawq_hashjoin_bloomfilter to on; SET
Time: 13.106 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
0 #<-- result error
(1 row)
Time: 1888.176 ms
@kuien Btw: If also test on mac, you can generate tpch data via my dbgen tools: https://github.com/interma/misc/tree/master/hawq/tpch_mac