spark icon indicating copy to clipboard operation
spark copied to clipboard

[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce

Open GideonPotok opened this issue 9 months ago • 0 comments

What changes were proposed in this pull request?

SPARK-47353

Pull requests

Scala TreeMap (RB Tree) GroupMapReduce <- Most performant Comparing Experimental Approaches

Central Change to Mode eval Algorithm:

  • Update to eval Method: The eval method now checks if the column being looked at is string with non-default collation and if so, uses a grouping
buff.toSeq.groupMapReduce {
        case (key: String, _) =>
          CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
        case (key: UTF8String, _) =>
          CollationFactory.getCollationKey(key, collationId)
        case (key, _) => key
      }(x => x)((x, y) => (x._1, x._2 + y._2)).values

Minor Change to Mode:

  • Introduction of collationId: A new lazy value collationId is computed from the dataType of the child expression, used to fetch the appropriate collation comparator when collationEnabled is true.

Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite to test new functionality including:

  • Tests for the Mode function when handling strings with different collation settings.

Benchmark Updates:

  • Enhanced the CollationBenchmark classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.

Why are the changes needed?

  1. Ensures consistency in handling string comparisons under various collation settings.
  2. Improves global usability by enabling compatibility with different collation standards.

Does this PR introduce any user-facing change?

Yes, this PR introduces the following user-facing changes:

  1. Adds a new collationEnabled property to the Mode expression.
  2. Users can now specify collation settings for the Mode expression to customize its behavior.

How was this patch tested?

This patch was tested through a combination of new and existing unit and end-to-end SQL tests.

  1. Unit Tests:

    • CollationStringExpressionsSuite:
      • TODO/WIP: Make the newly added tests more in the same design pattern as the existing tests
    • Added multiple test cases to verify that the Mode function correctly handles strings with different collation settings.
    • Specifically, tests include:
      • UTF8_BINARY_LCASE Collation: Verified that the function behaves as expected when processing lower-case-sensitive data.
      • Null Handling: Included tests for correct handling of null values with various collation settings.
  2. Benchmark Tests:

    • TODO DISCUSS PERFORMANCE
  3. Manual Testing:

 ./build/mvn -DskipTests clean package 
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
   spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
    import org.apache.spark.sql.types.StringType
    import org.apache.spark.sql.functions
    import spark.implicits._
    val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
    val df = data.toDF("word")
    val dfLC = df.withColumn("word",
      col("word").cast(StringType("UTF8_BINARY_LCASE")))
    val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
    dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
|  abc|
+-----+

AFTER:
+-----+
|count|
+-----+
|  Def|
+-----+

*/
  1. Continuous Integration (CI):
    • The patch passed all relevant Continuous Integration (CI) checks, including:
      • Unit test suite
      • Benchmark suite [TODO RUN AND ATTACH RESULTS]

Was this patch authored or co-authored using generative AI tooling?

No, but the PR Description was co-authored with Chat-GPT.

What is left to do before taking off the WIP tag?

  • [ ] Discuss/analyze performance in PR Description
  • [ ] Make sure benchmarks invoke the function enough times to have non-zero values in the benchmark results
  • [ ] Run benchmarks in GHA and add resultant benchmark reports to Pull Request
  • [x] Make the newly added tests more in the same design pattern as the existing tests
  • [x] Make the newly added tests cover all collations
  • [ ] Figure out within group return value (EG if lower case collation, between Seq(("Def"), ("def"), ("def"), ("DEF"), ("abc"), ("abc"), ("abc")), is Def, DEF, or def the answer?
  • Choices are
  1. First in data frame (Def)
  2. Within group Mode:
  3. Arbitrary/ Undefined
  4. Physical ordering Need to check but I think it is arbitrary in mode

GideonPotok avatar May 10 '24 13:05 GideonPotok