spark
spark copied to clipboard
[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using GroupMapReduce
What changes were proposed in this pull request?
Pull requests
Scala TreeMap (RB Tree) GroupMapReduce <- Most performant Comparing Experimental Approaches
Central Change to Mode eval
Algorithm:
- Update to
eval
Method: Theeval
method now checks if the column being looked at is string with non-default collation and if so, uses a grouping
buff.toSeq.groupMapReduce {
case (key: String, _) =>
CollationFactory.getCollationKey(UTF8String.fromString(key), collationId)
case (key: UTF8String, _) =>
CollationFactory.getCollationKey(key, collationId)
case (key, _) => key
}(x => x)((x, y) => (x._1, x._2 + y._2)).values
Minor Change to Mode:
- Introduction of
collationId
: A new lazy valuecollationId
is computed from thedataType
of thechild
expression, used to fetch the appropriate collation comparator whencollationEnabled
is true.
Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite
to test new functionality including:
- Tests for the
Mode
function when handling strings with different collation settings.
Benchmark Updates:
- Enhanced the
CollationBenchmark
classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.
Why are the changes needed?
- Ensures consistency in handling string comparisons under various collation settings.
- Improves global usability by enabling compatibility with different collation standards.
Does this PR introduce any user-facing change?
Yes, this PR introduces the following user-facing changes:
- Adds a new
collationEnabled
property to theMode
expression. - Users can now specify collation settings for the
Mode
expression to customize its behavior.
How was this patch tested?
This patch was tested through a combination of new and existing unit and end-to-end SQL tests.
-
Unit Tests:
-
CollationStringExpressionsSuite:
- TODO/WIP: Make the newly added tests more in the same design pattern as the existing tests
- Added multiple test cases to verify that the
Mode
function correctly handles strings with different collation settings. - Specifically, tests include:
- UTF8_BINARY_LCASE Collation: Verified that the function behaves as expected when processing lower-case-sensitive data.
- Null Handling: Included tests for correct handling of null values with various collation settings.
-
CollationStringExpressionsSuite:
-
Benchmark Tests:
- TODO DISCUSS PERFORMANCE
-
Manual Testing:
./build/mvn -DskipTests clean package
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.functions
import spark.implicits._
val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
val df = data.toDF("word")
val dfLC = df.withColumn("word",
col("word").cast(StringType("UTF8_BINARY_LCASE")))
val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
| abc|
+-----+
AFTER:
+-----+
|count|
+-----+
| Def|
+-----+
*/
-
Continuous Integration (CI):
- The patch passed all relevant Continuous Integration (CI) checks, including:
- Unit test suite
- Benchmark suite [TODO RUN AND ATTACH RESULTS]
- The patch passed all relevant Continuous Integration (CI) checks, including:
Was this patch authored or co-authored using generative AI tooling?
No, but the PR Description was co-authored with Chat-GPT.
What is left to do before taking off the WIP tag?
- [ ] Discuss/analyze performance in PR Description
- [ ] Make sure benchmarks invoke the function enough times to have non-zero values in the benchmark results
- [ ] Run benchmarks in GHA and add resultant benchmark reports to Pull Request
- [x] Make the newly added tests more in the same design pattern as the existing tests
- [x] Make the newly added tests cover all collations
- [ ] Figure out within group return value (EG if lower case collation, between Seq(("Def"), ("def"), ("def"), ("DEF"), ("abc"), ("abc"), ("abc")), is Def, DEF, or def the answer?
- Choices are
- First in data frame (Def)
- Within group Mode:
- Arbitrary/ Undefined
- Physical ordering Need to check but I think it is arbitrary in mode