spark icon indicating copy to clipboard operation
spark copied to clipboard

[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using a scala TreeMap (RB Tree)

Open GideonPotok opened this issue 9 months ago • 7 comments

What changes were proposed in this pull request?

SPARK-47353

Pull requests

Scala TreeMap (RB Tree) GroupMapReduce <- Most performant Comparing Experimental Approaches

Central Change to Mode eval Algorithm:

  • Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a TreeMap to manage keys and counts. - TreeMap takes a comparator to determine equality, so supports an efficient combination of counts by keys equal under that comparator but that were not by the OpenHashMap, which uses the default physical equality.

Minor Change to Mode:

  • Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite to test new functionality including:

  • Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

  • Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

  1. Ensures consistency in handling string comparisons under various collation settings.
  2. Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

  1. Adds a new collationEnabled property to the Mode expression.
  2. Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

  1. Unit Tests:

    • CollationStringExpressionsSuite:
      • TODO/WIP: Make the newly added tests more in the same design pattern as the existing tests
    • Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.
    • Specifically, tests include:
      • UTF8_BINARY_LCASE Collation: Verified that the function behaves as expected when processing lower-case-sensitive data.
      • Null Handling: Included tests for correct handling of null values with various collation settings.
  2. Benchmark Tests:

    • TODO DISCUSS PERFORMANCE
  3. Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/
  1. Continuous Integration (CI):
    • The patch passed all relevant Continuous Integration (CI) checks, including:
      • Unit test suite
      • Benchmark suite [TODO RUN AND ATTACH RESULTS]

Was this patch authored or co-authored using generative AI tooling?

No, but the PR Description was co-authored with Chat-GPT.

What is left to do before taking off the WIP tag?

  • [ ] Discuss/analyze performance in PR Description
  • [ ] Make sure benchmarks invoke the function enough times to have non-zero values in the benchmark results
  • [ ] Run benchmarks in GHA and add resultant benchmark reports to Pull Request
  • [x] Make the newly added tests more in the same design pattern as the existing tests
  • [x] Make the newly added tests cover all collations
  • [ ] Figure out within group return value (EG if lower case collation, between Seq(("Def"), ("def"), ("def"), ("DEF"), ("abc"), ("abc"), ("abc")), is Def, DEF, or def the answer?
  • Choices are
  1. First in data frame (Def)
  2. Within group Mode:
  3. Arbitrary/ Undefined
  4. Physical ordering Need to check but I think it is arbitrary in mode

GideonPotok avatar May 06 '24 15:05 GideonPotok

@uros-db This is ready for the initial round of review. Let me know what you think!

GideonPotok avatar May 06 '24 15:05 GideonPotok

Can you fill the PR description please

HyukjinKwon avatar May 07 '24 02:05 HyukjinKwon

Can you fill the PR description please

@HyukjinKwon done.

GideonPotok avatar May 07 '24 13:05 GideonPotok

@cloud-fan @MaxGekk @dbatomic just letting y'all know this is ready for first round of review. Thanks!

GideonPotok avatar May 07 '24 17:05 GideonPotok

just another quick note based on your previous PRs - no need to force push and disrupt the commit history, in general I think it's easier to follow changes in this PR as we comment & resolve conversations, and look at commits as they come

uros-db avatar May 07 '24 18:05 uros-db

kudos for your effort on this ticket! I left some comments

also, please try to ensure that your code doesn't introduce any breaking changes (for example, I see that SQLQueryTestSuite has a failing test for Mode) before requesting thorough review

however, it's also good to verify the approach as soon as possible - so I would advise making these fixes and committing the benchmark results

@uros-db Thanks for the review! I hear you that I might have waited before I brought in the other reviewers.

Thanks for initial feedback!

GideonPotok avatar May 07 '24 18:05 GideonPotok

just another quick note based on your previous PRs - no need to force push and disrupt the commit history, in general I think it's easier to follow changes in this PR as we comment & resolve conversations, and look at commits as they come

@uros-db The issue I run into is then I can't rebase with master realistically, as there will be a large number of rounds of conflict resolution, instead of one round before and after the squashed commit. Do you mind if I go back to continuously squashing all my commits, for this reason?

GideonPotok avatar May 09 '24 17:05 GideonPotok