Add support for collated strings in OpConverter
Summary
Since its 4.0 release, Spark now supports parametrizing StringType with different collations which define how string data is compared. This PR adds backwards-compatible support for passing collation information in Delta Sharing predicate expressions, with version-specific implementations for Spark 3.5 (Scala 2.12) and Spark 4.0 (Scala 2.13).
Changes
Core Implementation
- Added
ExprContextcase class to hold expression metadata, specificallycollationIdentifierfor string comparisons - Extended comparison operations to accept optional exprCtx parameter:
-
EqualOp,LessThanOp,LessThanOrEqualOp,GreaterThanOp,GreaterThanOrEqualOp- Also applies toInexpressions which are converted toEqualOpchains - Created version-specific
CollationExtractorimplementations: - Scala 2.13 (Spark 4.0): Extracts collation information from Spark'sStringTypeand populatescollationIdentifierin format: provider.collationName.icuVersion (e.g.,icu.UNICODE_CI.75.1,spark.UTF8_LCASE.75.1) - Scala 2.12 (Spark 3.5): Does not createcollationIdentifierand instead defaults toUTF8_BINARYcomparisons as collations are just a writer feature and delta. - Updated
OpConverterto: - CallCollationExtractor.extractCollationIdentifier()to extract collation information
Backwards Compatibility
- The
exprCtxparameter is optional (Option[ExprContext] = None), ensuring existing code continues to work - The valueType field remains as plain "string" (not "string collate
"), maintaining compatibility with older clients - Collation information is stored separately in
ExprContext, allowing non-collation-aware servers to ignore it - Default
UTF8_BINARYcollations (non-collated strings) work on both Spark 3.5 and 4.0
Validation
Added safety checks to prevent invalid comparisons:
- Throws
IllegalArgumentExceptionwhen comparing strings with different collations
Protocol Documentation
Updated PROTOCOL.md to document the new exprCtx field and collationIdentifier format with examples.
Is this ready for review?
Is this ready for review?
Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.
Is this ready for review?
Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.
cc @littlegrasscao
Is this ready for review?
Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.
cc @littlegrasscao
If the change only applies to 2.13 not in 2.12, you would need to make 2 copies of the files. 1 in 2.13 folder and apply the new change, 1 in 2.12 folder which still has the old code.
Check out examples like: client/src/main/scala-2.13/org/apache/spark/sql/DeltaSharingScanUtils.scala