delta-sharing icon indicating copy to clipboard operation
delta-sharing copied to clipboard

Add support for collated strings in OpConverter

Open stefankandic opened this issue 1 month ago • 4 comments

Summary

Since its 4.0 release, Spark now supports parametrizing StringType with different collations which define how string data is compared. This PR adds backwards-compatible support for passing collation information in Delta Sharing predicate expressions, with version-specific implementations for Spark 3.5 (Scala 2.12) and Spark 4.0 (Scala 2.13).

Changes

Core Implementation

  1. Added ExprContext case class to hold expression metadata, specifically collationIdentifier for string comparisons
  2. Extended comparison operations to accept optional exprCtx parameter: - EqualOp, LessThanOp, LessThanOrEqualOp, GreaterThanOp, GreaterThanOrEqualOp - Also applies to In expressions which are converted to EqualOp chains
  3. Created version-specific CollationExtractor implementations: - Scala 2.13 (Spark 4.0): Extracts collation information from Spark'sStringType and populates collationIdentifier in format: provider.collationName.icuVersion (e.g., icu.UNICODE_CI.75.1, spark.UTF8_LCASE.75.1) - Scala 2.12 (Spark 3.5): Does not create collationIdentifier and instead defaults to UTF8_BINARY comparisons as collations are just a writer feature and delta.
  4. Updated OpConverter to: - Call CollationExtractor.extractCollationIdentifier() to extract collation information

Backwards Compatibility

  • The exprCtx parameter is optional (Option[ExprContext] = None), ensuring existing code continues to work
  • The valueType field remains as plain "string" (not "string collate "), maintaining compatibility with older clients
  • Collation information is stored separately in ExprContext, allowing non-collation-aware servers to ignore it
  • Default UTF8_BINARY collations (non-collated strings) work on both Spark 3.5 and 4.0

Validation

Added safety checks to prevent invalid comparisons:

  • Throws IllegalArgumentException when comparing strings with different collations

Protocol Documentation

Updated PROTOCOL.md to document the new exprCtx field and collationIdentifier format with examples.

stefankandic avatar Nov 19 '25 14:11 stefankandic

Is this ready for review?

linzhou-db avatar Nov 21 '25 06:11 linzhou-db

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

stefankandic avatar Nov 21 '25 16:11 stefankandic

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

cc @littlegrasscao

linzhou-db avatar Nov 21 '25 17:11 linzhou-db

Is this ready for review?

Yes it should be. Although it is still not clear how we can have this change pass the 2.12 checks given that it uses new changes added in Spark 4.0.

cc @littlegrasscao

If the change only applies to 2.13 not in 2.12, you would need to make 2 copies of the files. 1 in 2.13 folder and apply the new change, 1 in 2.12 folder which still has the old code.

Check out examples like: client/src/main/scala-2.13/org/apache/spark/sql/DeltaSharingScanUtils.scala

littlegrasscao avatar Nov 21 '25 21:11 littlegrasscao