spark
spark copied to clipboard
[SPARK-47415][SQL] Collation support: Levenshtein
What changes were proposed in this pull request?
Levenshtein distance requires modifications to accommodate collated inputs properly:
- Allow collated inputs in
inputTypes
- Require same collations for input types in
CollationTypeCasts
- Comparison of byte substrings of
UTF8String
for equality is not enough anymore; we split checks of substrings (character) equality into classes for binary equality and semantic equality, latter is used with collations which don't support binary equality checks - Each collation gets substring equality checks statically in collation factory to reduce object allocation on each Levenshtein distance evaluation
Why are the changes needed?
To support Levenshtein distance expression with collations.
Does this PR introduce any user-facing change?
Yes, it changes behavior of Levenshtein distance expression when evaluated on collated strings.
How was this patch tested?
Added checks to CollationStringExpressionsSuite
in addition to handful of existing Levenshtein checks.
Was this patch authored or co-authored using generative AI tooling?
No.
Please review collation team @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric.
Please review latest iteration @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric
After a lengthy discussion we decided to stop Levenshtein collations effort. The well defined definition of Levenshtein as the minimum number of single-character edits to make the strings compare equal would be interpreted under collations as the minimum number of single-character edits to make the strings equal under given collation is very hard to achieve under non-exponential time.
Reason for this is that Levenshtein distance algorithm assumes we can compare single character in both strings for equality whereas under collations we cannot safely assume it is enough to compare single character. For example, sequences İ
and i\u0307
compare equal under UTF8_BINARY_LCASE
collation but have different number of characters (1 vs 2 in order).
Thanks @nebojsa-db and reviewers for your effort.