spark-rapids [FEA] Support bloom filter joins on Databricks

Relates to #7803. Databricks does not use BloomFilterMightContain and BloomFilterAggregate when implementing Bloom filter assisted joins. Instead they use BlockBloomFilterMightContain and BlockBloomFilterAggregate.

Aug 03 '23 13:08 jlowe

Initial scope is understanding impact of fallback with benchmarks on Databricks 11.3/12.2/13.3.

Feb 13 '24 21:02 mattahrens

NDS SF3K benchmark experiment results with spark.sql.optimizer.runtime.bloomFilter.enabled set to true (test) and false (baseline):

--------------------------------------------------------------------
Name = query16
Means = 4938.2, 2899.4
Time diff = 2038.7999999999997
Speedup = 1.703179968269297
T-Test (test statistic, p value, df) = 30.182426342750624, 1.5758462055072497e-09, 8.0
T-Test Confidence Interval = 1883.0311702473427, 2194.5688297526567
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query64
Means = 18665.8, 12114.6
Time diff = 6551.199999999999
Speedup = 1.5407689894837633
T-Test (test statistic, p value, df) = 48.586627207936644, 3.562824507885676e-11, 8.0
T-Test Confidence Interval = 6240.268882580042, 6862.131117419955
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query93
Means = 12480.4, 11776.6
Time diff = 703.7999999999993
Speedup = 1.059762580031588
T-Test (test statistic, p value, df) = 2.7663393196104678, 0.024434393189381686, 8.0
T-Test Confidence Interval = 117.11647251971294, 1290.4835274802856
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query94
Means = 5247.2, 3190.4
Time diff = 2056.7999999999997
Speedup = 1.6446840521564694
T-Test (test statistic, p value, df) = 36.89898739574463, 3.1911147346807857e-10, 8.0
T-Test Confidence Interval = 1928.2601771027562, 2185.3398228972433
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = query95
Means = 8282.2, 6988.6
Time diff = 1293.6000000000004
Speedup = 1.185101450934379
T-Test (test statistic, p value, df) = 5.730056177701596, 0.0004389295644865668, 8.0
T-Test Confidence Interval = 773.0035422047629, 1814.1964577952378
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed
--------------------------------------------------------------------
Name = benchmark
Means = 441800.0, 429000.0
Time diff = 12800.0
Speedup = 1.02983682983683
T-Test (test statistic, p value, df) = 11.494739329713592, 2.973528840991338e-06, 8.0
T-Test Confidence Interval = 10232.142471284506, 15367.857528715494
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed

Overall 3% improvement with the setting fully working and significant difference in queries 16, 64, 94, 95.

Mar 15 '24 20:03 mattahrens

@mattahrens significant speedup is expected with just that setting, since that's comparing non-Bloom filter joins vs. Bloom filter joins, with no GPU fallbacks in either case.

For the purposes of estimating the fallback cost of not implementing the new Bloom filter type, we should compare

spark.sql.optimizer.runtime.bloomFilter.enabled=true

as a baseline vs.

spark.sql.optimizer.runtime.bloomFilter.enabled=true
spark.rapids.sql.expression.BloomFilterMightContain=false
spark.rapids.sql.expression.BloomFilterAggregate=false

This will compare Bloom filter joins with GPU acceleration vs. Bloom filter joins where the build and probe of the Bloom filter falls back to the CPU.

Mar 15 '24 20:03 jlowe

Ran that benchmark and it had a bigger impact (~6%):

Name = benchmark
Means = 454600.0, 427800.0
Time diff = 26800.0
Speedup = 1.0626460963066853
T-Test (test statistic, p value, df) = 23.50515491742838, 1.1413900109260772e-08, 8.0
T-Test Confidence Interval = 24170.750755057958, 29429.249244942042
ALERT: significant change has been detected (p-value < 0.05)
ALERT: improvement in performance has been observed

Regressions noted in queries 13, 16, 64, 80, 94, 95.

Mar 15 '24 23:03 mattahrens