spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-47415][SQL] Collation support: Levenshtein

Open nikolamand-db opened this issue 10 months ago • 2 comments

What changes were proposed in this pull request?

Levenshtein distance requires modifications to accommodate collated inputs properly:

  • Allow collated inputs in inputTypes
  • Require same collations for input types in CollationTypeCasts
  • Comparison of byte substrings of UTF8String for equality is not enough anymore; we split checks of substrings (character) equality into classes for binary equality and semantic equality, latter is used with collations which don't support binary equality checks
  • Each collation gets substring equality checks statically in collation factory to reduce object allocation on each Levenshtein distance evaluation

Why are the changes needed?

To support Levenshtein distance expression with collations.

Does this PR introduce any user-facing change?

Yes, it changes behavior of Levenshtein distance expression when evaluated on collated strings.

How was this patch tested?

Added checks to CollationStringExpressionsSuite in addition to handful of existing Levenshtein checks.

Was this patch authored or co-authored using generative AI tooling?

No.

nikolamand-db avatar Apr 09 '24 18:04 nikolamand-db

Please review collation team @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric.

nikolamand-db avatar Apr 10 '24 10:04 nikolamand-db

Please review latest iteration @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric

nebojsa-db avatar May 10 '24 14:05 nebojsa-db

After a lengthy discussion we decided to stop Levenshtein collations effort. The well defined definition of Levenshtein as the minimum number of single-character edits to make the strings compare equal would be interpreted under collations as the minimum number of single-character edits to make the strings equal under given collation is very hard to achieve under non-exponential time.

Reason for this is that Levenshtein distance algorithm assumes we can compare single character in both strings for equality whereas under collations we cannot safely assume it is enough to compare single character. For example, sequences İ and i\u0307 compare equal under UTF8_BINARY_LCASE collation but have different number of characters (1 vs 2 in order).

Thanks @nebojsa-db and reviewers for your effort.

nikolamand-db avatar May 29 '24 09:05 nikolamand-db