spark [SPARK-47415][SQL] Collation support: Levenshtein

What changes were proposed in this pull request?

Levenshtein distance requires modifications to accommodate collated inputs properly:

Allow collated inputs in inputTypes
Require same collations for input types in CollationTypeCasts
Comparison of byte substrings of UTF8String for equality is not enough anymore; we split checks of substrings (character) equality into classes for binary equality and semantic equality, latter is used with collations which don't support binary equality checks
Each collation gets substring equality checks statically in collation factory to reduce object allocation on each Levenshtein distance evaluation

Why are the changes needed?

To support Levenshtein distance expression with collations.

Does this PR introduce any user-facing change?

Yes, it changes behavior of Levenshtein distance expression when evaluated on collated strings.

How was this patch tested?

Added checks to CollationStringExpressionsSuite in addition to handful of existing Levenshtein checks.

Was this patch authored or co-authored using generative AI tooling?

No.

Apr 09 '24 18:04 nikolamand-db

Please review collation team @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric.

Apr 10 '24 10:04 nikolamand-db

Please review latest iteration @dbatomic @stefankandic @uros-db @mihailom-db @stevomitric

May 10 '24 14:05 nebojsa-db

After a lengthy discussion we decided to stop Levenshtein collations effort. The well defined definition of Levenshtein as the minimum number of single-character edits to make the strings compare equal would be interpreted under collations as the minimum number of single-character edits to make the strings equal under given collation is very hard to achieve under non-exponential time.

Reason for this is that Levenshtein distance algorithm assumes we can compare single character in both strings for equality whereas under collations we cannot safely assume it is enough to compare single character. For example, sequences İ and i\u0307 compare equal under UTF8_BINARY_LCASE collation but have different number of characters (1 vs 2 in order).

Thanks @nebojsa-db and reviewers for your effort.

May 29 '24 09:05 nikolamand-db

spark spark copied to clipboard

[SPARK-47415][SQL] Collation support: Levenshtein

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard