datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

feat: Implement bloom_filter_agg

Open mbutrovich opened this issue 4 months ago • 7 comments

Which issue does this PR close?

Closes #846.

Rationale for this change

What changes are included in this PR?

  • Native implementation (bloom_filter_agg.rs) that uses DataFusion's Accumulator trait. We do not have a GroupsAccumulator implementation and leave it as a possible future optimization.
  • Serde logic (planner.rs, QueryPlanSerde.scala)
  • Serialization and merging logic for underlying data structures (spark_bloom_filter.rs, spark_bit_array.rs)

How are these changes tested?

  • New test in CometExecSuite
  • Spark tests in CI exercise this aggregation
  • Scala benchmark to compare against Spark code path
  • Native benchmark for partial and final aggregation modes
  • Native tests for new bit array logic spark_bit_array.rs

mbutrovich avatar Sep 30 '24 19:09 mbutrovich