spark [SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8

What changes were proposed in this pull request?

Changing the way that spark does initCap with respect to UTF8_BINARY collation. In this PR, initCap titlecases the first character of every word, and lowercases every other character. Words are separated only by ASCII space. Special care is taken when lowercasing Σ, to take into account if it is at the end of the word(with respect to case-ignorable characters) and should be lowercased into ς, or in other case into σ(this already works correctly with the current implementation because lowercasing a whole string handled this, but in this PR this was handled manually because lowercase function wasn't used).

The key difference between outputs that this PR introduces is:

input	current_initCap(input)	new_initCap(input)
İo	İo (I\u0307o)	İo
ß ﬁ ﬃ ﬀ ﬆ	ß ﬁ ﬃ ﬀ ﬆ	Ss Fi Ffi Ff St

These are just some examples, much more mappings are actually affected. More details about the key changes are in the next section. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default.

Why are the changes needed?

The previous implementation first lowercases the complete string, and then titlecases the first character of every word[1]. When titlecasing the first character of every word, it maps a single codepoint to a single codepoint[2].

This leads to the following behaviour with respect to [1]:

input	initCap(input)
İo	İo (I\u0307o)

In summary, when the lowercase of a first character(for example "İ") in a word maps onto more than 1 character(for example "I\u0307"), we only consider the first character("I" in "I\u0307") of that lowercased letter("İ") for titlecasing instead of that complete character because we titlecase only the first character in a word after we completely lowercase it.

The behaviour that [2] produces is:

input	initCap(input)
ß ﬁ ﬃ ﬀ ﬆ	ß ﬁ ﬃ ﬀ ﬆ

While the expected output would probably be:

input	initCap(input)
ß ﬁ ﬃ ﬀ ﬆ	Ss Fi Ffi Ff St

which clearly maps titlecase of each of those characters into more than one character, which is not handled because of [2].

Again, these are just examples and not an exhaustive list of all the mappings that have been changed.

Does this PR introduce any user-facing change?

Yes, InitCap expression will now return different results for:

One-to-many case mapping (e.g. Turkish dotted I, ß, ﬁ)

How was this patch tested?

Tests in CollationSupportSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

Aug 15 '24 13:08 viktorluc-db

@viktorluc-db I think the PR's description is a bit misleading. When I read it I feel like the changes are affecting only the ligatures listed, but in fact it is way more (basically we change how we titlecase 48 different characters). I think it would be nice to explain in the PR description that these are just some examples where the behavior is different).

Aug 23 '24 21:08 mkaravel

@MaxGekk ready for review

Aug 27 '24 13:08 viktorluc-db

@MaxGekk PTAL

Aug 28 '24 14:08 uros-db

[SPARK-48682][SQL][FOLLOW-UP] Changed initCap behaviour with UTF8_BINARY collation

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?