hawq HAWQ-1660. Optimize parquet scan when bloom filter enabled.

HAWQ-1660. Optimize parquet scan when bloom filter enabled.

Open kuien opened this issue 6 years ago • 3 comments

Sep 20 '18 10:09 kuien

It is a good optimization point. If a lot of columns will be projected, we can only fetch joinkey and do a bloomfilter check, if doesn't match, no need to fetch other columns.

But in this PR, if bloomfilter is not enable, it will fetch joinkey in the first loop, and fetch other columns in the second loop, which needs a little refine further.

Sep 21 '18 03:09 linwen

@kuien I do a perf test on your pr, two issues:

query result error
performance downgrade

Details see below, please check code, thanks.

TPCH1G data on my mac, master code

tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
  6088
(1 row)

Time: 3150.873 ms
tpch=# set hawq_hashjoin_bloomfilter to on;
SET
Time: 2.903 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
  6088
(1 row)

Time: 1512.782 ms

your code

tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
 count
-------
  6088
(1 row)

Time: 49466.999 ms #<-- result ok, but bad performance
tpch=# set hawq_hashjoin_bloomfilter to on;                                                                             SET
Time: 13.106 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
 count
-------
     0 #<-- result error
(1 row)

Time: 1888.176 ms

Oct 02 '18 12:10 interma

@kuien Btw: If also test on mac, you can generate tpch data via my dbgen tools: https://github.com/interma/misc/tree/master/hawq/tpch_mac

Oct 02 '18 13:10 interma

hawq hawq copied to clipboard

HAWQ-1660. Optimize parquet scan when bloom filter enabled.

hawq
hawq copied to clipboard