hawq icon indicating copy to clipboard operation
hawq copied to clipboard

HAWQ-1660. Optimize parquet scan when bloom filter enabled.

Open kuien opened this issue 6 years ago • 3 comments

kuien avatar Sep 20 '18 10:09 kuien

It is a good optimization point. If a lot of columns will be projected, we can only fetch joinkey and do a bloomfilter check, if doesn't match, no need to fetch other columns.

But in this PR, if bloomfilter is not enable, it will fetch joinkey in the first loop, and fetch other columns in the second loop, which needs a little refine further.

linwen avatar Sep 21 '18 03:09 linwen

@kuien I do a perf test on your pr, two issues:

  1. query result error
  2. performance downgrade

Details see below, please check code, thanks.

TPCH1G data on my mac, master code

tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
  6088
(1 row)

Time: 3150.873 ms
tpch=# set hawq_hashjoin_bloomfilter to on;
SET
Time: 2.903 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
count
-------
  6088
(1 row)

Time: 1512.782 ms

your code

tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
 count
-------
  6088
(1 row)

Time: 49466.999 ms #<-- result ok, but bad performance
tpch=# set hawq_hashjoin_bloomfilter to on;                                                                             SET
Time: 13.106 ms
tpch=# select count (*) from part, lineitem where p_partkey = l_partkey and p_brand = 'Brand#23' and p_container = 'MED BOX';
 count
-------
     0 #<-- result error
(1 row)

Time: 1888.176 ms 

interma avatar Oct 02 '18 12:10 interma

@kuien Btw: If also test on mac, you can generate tpch data via my dbgen tools: https://github.com/interma/misc/tree/master/hawq/tpch_mac

interma avatar Oct 02 '18 13:10 interma