spark
spark copied to clipboard
[WIP][SPARK-47353][SQL] Enable collation support for the Mode expression using a scala TreeMap (RB Tree)
What changes were proposed in this pull request?
Pull requests
Scala TreeMap (RB Tree) GroupMapReduce <- Most performant Comparing Experimental Approaches
Central Change to Mode eval
Algorithm:
- Update to
eval
Method: Theeval
method now checks if the column being looked at is string with non-default collation and if so, uses aTreeMap
to manage keys and counts. - TreeMap takes a comparator to determine equality, so supports an efficient combination of counts by keys equal under that comparator but that were not by the OpenHashMap, which uses the default physical equality.
Minor Change to Mode:
- Introduction of
collationId
: A new lazy valuecollationId
is computed from thedataType
of thechild
expression, used to fetch the appropriate collation comparator whencollationEnabled
is true.
Unit Test Enhancements: Significant additions to CollationStringExpressionsSuite
to test new functionality including:
- Tests for the
Mode
function when handling strings with different collation settings.
Benchmark Updates:
- Enhanced the
CollationBenchmark
classes to include benchmarks for the new mode functionality with and without collation settings, as well as numerical types.
Why are the changes needed?
- Ensures consistency in handling string comparisons under various collation settings.
- Improves global usability by enabling compatibility with different collation standards.
Does this PR introduce any user-facing change?
Yes, this PR introduces the following user-facing changes:
- Adds a new
collationEnabled
property to theMode
expression. - Users can now specify collation settings for the
Mode
expression to customize its behavior.
How was this patch tested?
This patch was tested through a combination of new and existing unit and end-to-end SQL tests.
-
Unit Tests:
-
CollationStringExpressionsSuite:
- TODO/WIP: Make the newly added tests more in the same design pattern as the existing tests
- Added multiple test cases to verify that the
Mode
function correctly handles strings with different collation settings. - Specifically, tests include:
- UTF8_BINARY_LCASE Collation: Verified that the function behaves as expected when processing lower-case-sensitive data.
- Null Handling: Included tests for correct handling of null values with various collation settings.
-
CollationStringExpressionsSuite:
-
Benchmark Tests:
- TODO DISCUSS PERFORMANCE
-
Manual Testing:
./build/mvn -DskipTests clean package
export SPARK_HOME=/Users/gideon/repos/spark
$SPARK_HOME/bin/spark-shell
spark.sqlContext.setConf("spark.sql.collation.enabled", "true")
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.functions
import spark.implicits._
val data = Seq(("Def"), ("def"), ("DEF"), ("abc"), ("abc"))
val df = data.toDF("word")
val dfLC = df.withColumn("word",
col("word").cast(StringType("UTF8_BINARY_LCASE")))
val dfLCA = dfLC.agg(org.apache.spark.sql.functions.mode(functions.col("word")).as("count"))
dfLCA.show()
/*
BEFORE:
-----+
|count|
+-----+
| abc|
+-----+
AFTER:
+-----+
|count|
+-----+
| Def|
+-----+
*/
-
Continuous Integration (CI):
- The patch passed all relevant Continuous Integration (CI) checks, including:
- Unit test suite
- Benchmark suite [TODO RUN AND ATTACH RESULTS]
- The patch passed all relevant Continuous Integration (CI) checks, including:
Was this patch authored or co-authored using generative AI tooling?
No, but the PR Description was co-authored with Chat-GPT.
What is left to do before taking off the WIP tag?
- [ ] Discuss/analyze performance in PR Description
- [ ] Make sure benchmarks invoke the function enough times to have non-zero values in the benchmark results
- [ ] Run benchmarks in GHA and add resultant benchmark reports to Pull Request
- [x] Make the newly added tests more in the same design pattern as the existing tests
- [x] Make the newly added tests cover all collations
- [ ] Figure out within group return value (EG if lower case collation, between Seq(("Def"), ("def"), ("def"), ("DEF"), ("abc"), ("abc"), ("abc")), is Def, DEF, or def the answer?
- Choices are
- First in data frame (Def)
- Within group Mode:
- Arbitrary/ Undefined
- Physical ordering Need to check but I think it is arbitrary in mode
@uros-db This is ready for the initial round of review. Let me know what you think!
Can you fill the PR description please
Can you fill the PR description please
@HyukjinKwon done.
@cloud-fan @MaxGekk @dbatomic just letting y'all know this is ready for first round of review. Thanks!
just another quick note based on your previous PRs - no need to force push and disrupt the commit history, in general I think it's easier to follow changes in this PR as we comment & resolve conversations, and look at commits as they come
kudos for your effort on this ticket! I left some comments
also, please try to ensure that your code doesn't introduce any breaking changes (for example, I see that
SQLQueryTestSuite
has a failing test forMode
) before requesting thorough reviewhowever, it's also good to verify the approach as soon as possible - so I would advise making these fixes and committing the benchmark results
@uros-db Thanks for the review! I hear you that I might have waited before I brought in the other reviewers.
Thanks for initial feedback!
just another quick note based on your previous PRs - no need to force push and disrupt the commit history, in general I think it's easier to follow changes in this PR as we comment & resolve conversations, and look at commits as they come
@uros-db The issue I run into is then I can't rebase with master realistically, as there will be a large number of rounds of conflict resolution, instead of one round before and after the squashed commit. Do you mind if I go back to continuously squashing all my commits, for this reason?