spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8_BINARY collation

Open viktorluc-db opened this issue 1 year ago • 2 comments

What changes were proposed in this pull request?

Changing the way that spark does initCap with respect to UTF8_BINARY collation. In this PR, initCap titlecases the first character of every word, and lowercases every other character. Words are separated only by ASCII space. Special care is taken when lowercasing Σ, to take into account if it is at the end of the word(with respect to case-ignorable characters) and should be lowercased into ς, or in other case into σ(this already works correctly with the current implementation because lowercasing a whole string handled this, but in this PR this was handled manually because lowercase function wasn't used).

The key difference between outputs that this PR introduces is:

input current_initCap(input) new_initCap(input)
İo İo (I\u0307o) İo
ß fi ffi ff st ß fi ffi ff st Ss Fi Ffi Ff St

These are just some examples, much more mappings are actually affected. More details about the key changes are in the next section. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default.

Why are the changes needed?

The previous implementation first lowercases the complete string, and then titlecases the first character of every word[1]. When titlecasing the first character of every word, it maps a single codepoint to a single codepoint[2].

This leads to the following behaviour with respect to [1]:

input initCap(input)
İo İo (I\u0307o)

In summary, when the lowercase of a first character(for example "İ") in a word maps onto more than 1 character(for example "I\u0307"), we only consider the first character("I" in "I\u0307") of that lowercased letter("İ") for titlecasing instead of that complete character because we titlecase only the first character in a word after we completely lowercase it.

The behaviour that [2] produces is:

input initCap(input)
ß fi ffi ff st ß fi ffi ff st

While the expected output would probably be:

input initCap(input)
ß fi ffi ff st Ss Fi Ffi Ff St

which clearly maps titlecase of each of those characters into more than one character, which is not handled because of [2].

Again, these are just examples and not an exhaustive list of all the mappings that have been changed.

Does this PR introduce any user-facing change?

Yes, InitCap expression will now return different results for:

  • One-to-many case mapping (e.g. Turkish dotted I, ß, fi)

How was this patch tested?

Tests in CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

viktorluc-db avatar Aug 15 '24 13:08 viktorluc-db

@viktorluc-db I think the PR's description is a bit misleading. When I read it I feel like the changes are affecting only the ligatures listed, but in fact it is way more (basically we change how we titlecase 48 different characters). I think it would be nice to explain in the PR description that these are just some examples where the behavior is different).

mkaravel avatar Aug 23 '24 21:08 mkaravel

@MaxGekk ready for review

viktorluc-db avatar Aug 27 '24 13:08 viktorluc-db

@MaxGekk PTAL

uros-db avatar Aug 28 '24 14:08 uros-db