KEEP icon indicating copy to clipboard operation
KEEP copied to clipboard

Locale-agnostic case conversions by default

Open qurbonzoda opened this issue 3 years ago • 56 comments

This issue is to discuss the proposal to introduce case conversion API that do not depend on the default locale settings in standard library.

The new API includes:

  • String.uppercase(): String
  • String.lowercase(): String
  • Char.lowercase(): String, Char.lowercaseChar(): Char
  • Char.uppercase(): String, Char.uppercaseChar(): Char
  • Char.titlecase(): String, Char.titlecaseChar(): Char
  • String.replaceFirstChar(transform: (Char) -> Char): String
  • String.replaceFirstChar(transform: (Char) -> CharSequence): String

They are to replace existing locale-sensitive API.

The proposal text is here.

qurbonzoda avatar Oct 05 '20 01:10 qurbonzoda

What about uppercaseFirst and lowercaseFirst instead of capitalizeFirst and decapitalizeFirst? Looks more intuitive since capitalization is related to the whole string, so (de)capitalizeFirst sounds strange.

edrd-f avatar Oct 09 '20 16:10 edrd-f

The proposal makes sense to me. I agree that (de)capitalizeFirst sounds strange as it expresses the same effect twice.

Additionally, such functions typically name their result (not the process), so I'd suggest these replacements:

  • String.capitalized() and String.capitalized(locale: Locale)
  • String.decapitalized() and String.decapitalized(locale: Locale)

OliverO2 avatar Oct 09 '20 17:10 OliverO2

Problem is that it's only one letter from the original (de)capitalize() functions, making it easy to use the wrong one, especially with IDE autocomplete.

edrd-f avatar Oct 09 '20 18:10 edrd-f

True, that where I'd hope that the Kotlin IDE plugin will always point to the right version and steer users away from the deprecated one.

OliverO2 avatar Oct 09 '20 19:10 OliverO2

String.func(): String

Such an approach involves calling existingCharSequence.toString().func() and creating two copies. It looks more thrifty to implement CharSequence which will proxy to original sequence calling toUpperCase/toTitleCase/toLowerCase for characters which require this, and one could call toString explicitly to evaluate into a 'normal' String eagerly, if this is required.

Caveats:

  • if a (de)capitalized code point occupies different number of UTF-16 words, well, this is a disaster;
  • Android has its own GetChars interface which would be nice to implement.

What about uppercaseFirst

@edrd-f, first character should be transformed to title case, not to upper case.

Miha-x64 avatar Oct 10 '20 13:10 Miha-x64

I'd also prefer to avoid ed endings like in uppercased. Most Kotlin API doesn't use it. E.g. it's map and filter, not mapped and filtered.

In the past capitalize was quite confusing. I've always assumed that it affects the entire string. So the First suffix is welcome here.

Until @Miha-x64's comment I didn't even know that titlecase is a thing in Unicode and differs from uppercase. Since unicode.org calls it titlecase, why not have it consistent and use titlecaseFirst?

As for the opposite, I wonder whether this is actually useful. Titlecasing or uppercasing the first character isn't always reversible because there's information loss.

"ß".capitalize().decapitalize() // sS
"ß".uppercaseFirst().lowercaseFirst() // sS // hypothetical

Oops.

I'd deprecate and then remove decapitalize without replacement.

fluidsonic avatar Oct 11 '20 05:10 fluidsonic

Why would capitalize transform the first to title case and not simply uppercase? Is this part of the capitalize meaning in Unicode? If so, or if its meaning is not clear, then we should shy away from introducing them. I mean, the whole reasoning behind this KEEP is around APIs that are hard to use because what they actually do is only explained in their comment.

I think that the following...

fun String.uppercase(): String
fun String.uppercaseFirst(): String
fun String.lowercase()
fun String.lowercaseFirst(): String

...creates a nice API that has symmetry and autocompletes perfectly. Sure, some Unicode characters won't go in both directions but that's a Unicode problem and not ours to solve. (There is a capital ẞ, hence, they are sometimes solving these issues.)

Also, a function that transforms only the first char to title case should be called titlecaseFirst and not capitalize for the same reasons and least surprise.

I really like the proposal and fully support it. It's cumbersome to explain this in every PR.

Fleshgrinder avatar Oct 11 '20 19:10 Fleshgrinder

What's the use case for uppercase/lowercaseFirst?

fluidsonic avatar Oct 11 '20 19:10 fluidsonic

snake_case.split("_").map { it.capitalize() }.joinToString("")

// outputs camelcase

On 2020/10/11 21:59, Marc Knaup wrote:

What's the use case for |uppercase/lowercaseFirst|?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Kotlin/KEEP/issues/223#issuecomment-706759732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARLC5UG66A5YP6KUQJK5E3SKIFALANCNFSM4SECADTQ.

janvladimirmostert avatar Oct 11 '20 20:10 janvladimirmostert

FYI capitalize as it is implemented today uses titlecase. Uppercase only as a fallback.

fluidsonic avatar Oct 11 '20 20:10 fluidsonic

Capitalize is quite useful when doing code translation and translating someone's variable names from snake case (Python to Kotlin translation or in one case, i parsed somebody's ANTLR file and generated a cleaner ANTLR file stripping all snake_case. I've not used it for anything else though, but have not had issues with it either (maybe since everything was 100% English)

Not sure what one would use decapitalizeFirst for, unless it's more efficient than just lowercasing the whole String?

Looks interesting

janvladimirmostert avatar Oct 11 '20 20:10 janvladimirmostert

UpperCamelCase becomes lowerCamelCase with lowercaseFirst so it clearly provides the same value. Same for lowerCamelCase to UpperCamelCase, but only if it's uppercase and not title case.

Fleshgrinder avatar Oct 11 '20 21:10 Fleshgrinder

There is a lot of confusion around titlecase vs uppercase. As @fluidsonic has already mentioned, here is what Unicode says:

Titlecase takes its name from the case format used when forming a title, in which the initial letter in a 
word is capitalized and the rest are not. Titlecase is also used in forming a sentence by capitalizing 
the first word, and for forming proper names. The titlecase mapping in the Unicode Standard is the mapping 
applied to the initial character in a word.

The titlecase mapping in Unicode differs from the uppercase mapping in that a number of characters 
require special handling. These are chiefly ligatures and digraphs such as 'fl', 'dz', and 'lj', plus a number
of polytonic Greek characters. For example, U+01C7 (LJ) maps to U+01C8 (Lj) rather than to U+01C9 (lj).

qurbonzoda avatar Oct 13 '20 23:10 qurbonzoda

We have two main use cases for capitalizeFirst.

  1. To transform identifiers in code. In this case titlecase and uppercase are likely equal, because identifiers are usually ASCII chars.
  2. To capitalize the first letter of a sentence, name, book title, and so on. Here titlecase may differ from uppercase.

titlecaseFirst naming could be used instead of capitalizeFirst, but the function named capitalizeFirst in this proposal titlecases only lowercase letters, e.g. "DŽ".capitalizeFirst() == "DŽ" while 'DŽ'.titlecase() == 'Dž'. This is done to avoid changing case of the first char in an ALL-CAPS word.

Another problem with titlecaseFirst is its reverse operation name, untitlecaseFirst, maybe. decapitalizeFirst is the same as lowercaseFirst would be.

qurbonzoda avatar Oct 13 '20 23:10 qurbonzoda

There are three transformations for characters: "titlecase", "uppercase" and "lowercase". So it makes sense to offer three transformation methods:

  • titlecaseFirst() transforms the first character to titlecase. There is no "inverse" to titlecasing.
  • uppercaseFirst() transforms the first character to uppercase. There is no "inverse" to uppercasing.
  • lowercaseFirst() transforms the first character to lowercase. There is no "inverse" to lowercasing.

With these three options the user can freely switch between all three variants. No need for an arbitrary de-something() function with a weird name

That approach would also address the issue with .capitalize().decapitalize(). It looks like it will always return the initial string while in fact it won't always do so. .titlecaseFirst().lowercaseFirst() is a little less prone to that because we only guarantee that the first letter will be lowercased, not that we undo the entire capitalization. E.g. in the latter example it's clear why ß -> SS -> sS while in the former example the user would likely expect ß -> SS -> ß.

I know that ß at the beginning of a word is unlikely. I just use it as an example because I happen to have that character in my primary language. I'm sure that there are more cases in other languages :)

Note that there's also potential for confusion for developers who're using other programming languages. Swift's .capitalized for example capitalizes all words, not just the first character of the string. Also, as per their documentation, they use uppercase instead of titlecase.


You've mentioned another problem with titlecaseFirst() for strings/words that are, for example, all-caps.

We don't know the user's intention if they call titlecaseFirst(). If that method has exactly that name then it should also do exactly what the name states: titlecase the first character. Not under some condition that isn't obvious at first.

What we could offer is an option to ignore uppercase, e.g. titlecaseFirst(ignoreUppercase = true). Then it's explicit. Alternatively we could call the function capitalizeFirst() and make it clear in the documentation that uppercased characters aren't titlecased. We still don't need decapitalizeFirst() as the user can simply use .lowercaseFirst().

fluidsonic avatar Oct 14 '20 00:10 fluidsonic

I second what @fluidsonic wrote, this makes perfect sense. So all in all we would end up with the following API, correct?

// KOTLIN 🠒 kotlin
fun String.lowercase(): String
fun String.lowercase(locale: Locale): String

// KOTLIN 🠒 kOTLIN
fun String.lowercaseFirst(): String
fun String.lowercaseFirst(locale: Locale): String

// kotlin 🠒 KOTLIN
fun String.uppercase(): String
fun String.uppercase(locale: Locale): String

// kotlin 🠒 KOTLIN
fun String.uppercaseFirst(): String
fun String.uppercaseFirst(locale: Locale): String

// fluX caPaciTor 🠒 Flux Capacitor
fun String.titlecase(): String
fun String.titlecase(locale: Locale): String

// fluX caPaciTor 🠒 FluX caPaciTor
fun String.titlecaseFirst(): String
fun String.titlecaseFirst(locale: Locale): String

I agree that titlecaseFirst does not look useful, however, I would include it even if it is just for API symmetry.

Fleshgrinder avatar Oct 14 '20 07:10 Fleshgrinder

  1. What's the use case for differentiating between titlecaseFirst and uppercaseFirst if it doesn't matter in technical contexts (ASCII) and title-casing is the correct form otherwise?

  2. If titlecase() would involve title-casing multiple words after word-splitting, we might enter more complicated locale-specific territory. E.g., in English, we'd have to deal with hyphenated words like Kotlin-Specific. This would probably outgrow the standard library scope. If so, we should drop the ...First() variants.

  3. Naming the methods as suggested above indicates they would mutate the object instead of returning a new one, which is inconsistent with Kotlin Coding Conventions – Choosing good names:

    The name should also suggest if the method is mutating the object or returning a new one. For instance sort is sorting a collection in place, while sorted is returning a sorted copy of the collection.

  4. I'd be in favor of a simple API, which is consistent with the Coding Conventions and sticks with title-casing in its locale-specific variants:

    • upercased()
    • lowercased()
    • capitalized()
    • decapitalized() – only if there is a sound use case for having this one instead of lowercased()

OliverO2 avatar Oct 14 '20 08:10 OliverO2

Title and upper are not the same for the following characters:

lower title upper
DŽ Dž DŽ
Dž Dž DŽ
dž Dž DŽ
LJ Lj LJ
Lj Lj LJ
lj Lj LJ
NJ Nj NJ
Nj Nj NJ
nj Nj NJ
DZ Dz DZ
Dz Dz DZ
dz Dz DZ
ა ა Ა
ბ ბ Ბ
გ გ Გ
დ დ Დ
ე ე Ე
ვ ვ Ვ
ზ ზ Ზ
თ თ Თ
ი ი Ი
კ კ Კ
ლ ლ Ლ
მ მ Მ
ნ ნ Ნ
ო ო Ო
პ პ Პ
ჟ ჟ Ჟ
რ რ Რ
ს ს Ს
ტ ტ Ტ
უ უ Უ
ფ ფ Ფ
ქ ქ Ქ
ღ ღ Ღ
ყ ყ Ყ
შ შ Შ
ჩ ჩ Ჩ
ც ც Ც
ძ ძ Ძ
წ წ Წ
ჭ ჭ Ჭ
ხ ხ Ხ
ჯ ჯ Ჯ
ჰ ჰ Ჰ
ჱ ჱ Ჱ
ჲ ჲ Ჲ
ჳ ჳ Ჳ
ჴ ჴ Ჴ
ჵ ჵ Ჵ
ჶ ჶ Ჶ
ჷ ჷ Ჷ
ჸ ჸ Ჸ
ჹ ჹ Ჹ
ჺ ჺ Ჺ
ჽ ჽ Ჽ
ჾ ჾ Ჾ
ჿ ჿ Ჿ

The remark on the coding conventions is very correct, so we would have:

// KOTLIN 🠒 kotlin
fun String.lowercased(): String
fun String.lowercased(locale: Locale): String

// KOTLIN 🠒 kOTLIN
fun String.lowercasedFirst(): String
fun String.lowercasedFirst(locale: Locale): String

// kotlin 🠒 KOTLIN
fun String.uppercased(): String
fun String.uppercased(locale: Locale): String

// kotlin 🠒 KOTLIN
fun String.uppercasedFirst(): String
fun String.uppercasedFirst(locale: Locale): String

// fluX caPaciTor 🠒 Flux Capacitor
fun String.titlecased(): String
fun String.titlecased(locale: Locale): String

// fluX caPaciTor 🠒 FluX caPaciTor
fun String.titlecasedFirst(): String
fun String.titlecasedFirst(locale: Locale): String

Your remark regarding title case is very true, it's not simple. So maybe it makes sense to not have any title case variation at all.

Fleshgrinder avatar Oct 14 '20 08:10 Fleshgrinder

Naming the methods as suggested above indicates they would mutate the object instead of returning a new one, which is inconsistent with Kotlin Coding Conventions – Choosing good names:

Immutable types typically do not follow that coding convention, they cannot be mutated in-place because of their nature. For instance, fun Int.and(other: Int): Int, fun Int.inc(): Int, fun String.trim(): String, fun String.drop(n: Int): String.

qurbonzoda avatar Oct 14 '20 15:10 qurbonzoda

What are use cases for uppercaseFirst and titlecaseFirst?

As I have mentioned above, we have two main use cases for capitalizeFirst. In the first case both uppercaseFirst and titlecaseFirst can be used. In the second case only conversion from lowercase to titlecase arguably makes sense. That's why capitalizeFirst has that "strange" hehavior.

My question is what value does introducing both uppercaseFirst and titlecaseFirst add? It's good that they explicitly state what they actually do, but use cases are more important.

fun String.titlecaseFirst(ignoreUppercase: Boolean = true): String might be a good option, though we don't have any use cases for ignoreUppercase to accept other value than true.

qurbonzoda avatar Oct 14 '20 16:10 qurbonzoda

// fluX caPaciTor -> Flux Capacitor fun String.titlecased(): String fun String.titlecased(locale: Locale): String

This behavior may be confusing as lowercase() and uppercase() convert every letter in the receiver string to corresponding case.

qurbonzoda avatar Oct 14 '20 16:10 qurbonzoda

Use cases for uppercaseFirst and lowercaseFirst for me are those that were already mentioned: converting something from lowerCamelCase to UpperCamelCase and UpperCamelCase to lowerCamelCase. These are extremely common conversion in the Kotlin world due to our naming conventions.

I do not have any use case for titlecase or titlecaseFirst and am actually against them, especially after @OliverO2’s remark regarding additional rules that would need to be honored. I think these kind of conversion are better left to libraries like ICU4j.

As I have mentioned above, we have two main use cases for capitalizeFirst. […] In the second case only conversion from lowercase to titlecase arguably makes sense.

The proposed titlecaseFirst does not work for that use case because it does not properly title case the title of a book, especially not if it only title cases the first letter.

This behavior may be confusing as lowercase() and uppercase() convert every letter in the receiver string to corresponding case.

For me the definition of title case follows https://en.wikipedia.org/wiki/Title_case and thus the illustrated behavior is what I would expect if I call them.

Fleshgrinder avatar Oct 14 '20 17:10 Fleshgrinder

@qurbonzoda

Immutable types typically do not follow that coding convention, they cannot be mutated in-place because of their nature. For instance, fun Int.and(other: Int): Int, fun Int.inc(): Int, fun String.trim(): String, fun String.drop(n: Int): String.

I don't know what the idea was at the time when the above names were chosen. Maybe some familiarity with other languages? (I'd not be surprised if we'd find methods inconsistent with the coding convention even in mutable contexts for historic reasons.)

As learning Kotlin is meant to be a fun experience for new developers, consistency helps to reduce the cognitive load and makes everything more enjoyable. So wouldn't this API redesign be an excellent opportunity to choose consistent naming as suggested by the coding conventions?

If I did not overlook something, capitalized() and decapitalized() seem to cover all use cases mentioned so far if they were using title-case conversion on the initial letter. In ASCII contexts, it doesn't matter. With Unicode ligatures, title-casing/lowercasing the initial letter would always be the correct way. I'd prefer the First postfix to be omitted, as we are dealing with string objects, not the individual words they might contain.

More exotic use cases would probably better be covered by extension functions, either in a specialized library or in user code.

OliverO2 avatar Oct 14 '20 20:10 OliverO2

There are plenty of examples for non-"ed" names, esp. around Strings, Collections, Sequences and Flows. So it's kinda consistent.

  • .capitalize()
  • .drop…()
  • .encodeTo…()
  • .replace…()
  • .slice()
  • .trim()
  • .map…()
  • .associate…()

Exceptions:

  • .sorted…()

You get the idea.

I agree that it would be nice to have a clear distinction between function names that return a copy and function names that modify the instance directly. That's not easy to define. We'd have to consider what the majority of usage is (likely immutable), what's seen in other languages (Swift has a clear distinction) and also how the language will evolve (val class may need special consideration). But that's a separate and quite large issue.

I'd stick with Kotlin's default for now, which is using …ed only when there's a conflict (like with .sort()).

fluidsonic avatar Oct 14 '20 21:10 fluidsonic

Valid examples. Note that indicating non-mutation is not limited to ed endings. For example, using prefixes like in as... and with... also works:

  • Iterable<T>.asSequence()

But I agree, this is a wider issue.

OliverO2 avatar Oct 15 '20 09:10 OliverO2

as… doesn't refer to mutability but rather that it wraps an existing value and uses it internally even after the function has returned. That's relevant for example if Iterable can only be iterated once.

fluidsonic avatar Oct 15 '20 12:10 fluidsonic

@OliverO2

I'd prefer the First postfix to be omitted, as we are dealing with string objects, not the individual words they might contain.

The First postfix is referred to the first letter of the receiver String (or the first word, seems irrelevant). It is added to distinguish the new function from the old locale-sensitive variant.

P.S. Also some users were confused with the old naming. Because capitalize could mean <capitalize the first letter of each word> or <capitalize all letters in this string>.

qurbonzoda avatar Oct 19 '20 11:10 qurbonzoda

@qurbonzoda As this seems to be a wider topic, maybe it's best to discuss priorities with respect to different naming options within the entire Kotlin language team, e.g.

  • When is it appropriate to deviate from the coding conventions? (Are preexisting cases a justification?)
  • Should the coding conventions receive an update with respect to immutable types?
  • When names change:
    • Should new names be chosen primarily to be more distinguishable from previous names, making the transition easier for non-IDE users?
    • Or should better long-term naming (consistency) be a priority even if that means relying more on IDE deprecation help?

OliverO2 avatar Oct 19 '20 12:10 OliverO2

The First postfix is referred to the first letter of the receiver String (or the first word, seems irrelevant).

It's not irrelevant and at least for me capitalize is confusing because capitalization is about writing the first letter in capital and everything else in lower: https://en.wikipedia.org/wiki/Capitalization

The uppercase first is less ambiguous here, although the question regarding letter, code point, word, sentence, … remains. Speaking of which, we are very concerned regarding Unicode, and I assume UTF-8 here because that is the Kotlin default. However, all APIs we have been talking about so far are UTF-16 based and char does not necessarily represent a single code point here. In other words: no matter what we call it it is ambiguous unless we want to have capitalizeFirstUtf16CodePointToTitleOrUpperCaseDependingOnWhetherTheyAreTheSameOrNot. 😝

Fleshgrinder avatar Oct 19 '20 12:10 Fleshgrinder

Before writing this KEEP we within Kotlin Libraries Team had a discussion of the API. Here are rationales behind some of our choices that were debated:

Why did we choose verb "capitalize" over "titlecase"?

Their are many different interpretations of the words:

  • Python capitalize - first character capitalized and the rest lowercased. Changed in version 3.8: The first character is now put into titlecase rather than uppercase.
  • Obj-C/Swift capitalized - first character in each word changed to its corresponding uppercase value, and all remaining characters set to their corresponding lowercase values.
  • C# ToTitleCase - converts the first character of a word to uppercase and the rest of the characters to lowercase (except for words that are entirely in uppercase, which are considered to be acronyms).
  • Golang ToTitle - all Unicode letters are mapped to their Unicode title case.

There is no correct name for the operation that currently String.capitalize() performs, and neither of the considered words provide consistent meaning. People coming from different programming languages could interpret the same name differently. So we decided to stick with our current capitalize verb to describe all the features of the operation.

Why did we add First postfix?

Firstly, to distinguish from the old function. Secondly, to add a hint that only the first character is converted.

Why did we choose not to use -ed ending?

Looking at String API we saw few -ed functions: chunked, reversed, windowed and many functions without -ed: trim, trimStart, trimEnd, format, zip, take, takeWhile, split, slice, replace, remove, padStart, padEnd, and many more. So we decided not to focus much on naming consistency. If we had selected -ed variant, First postfix would have been dropped, as capitalizedFirst sounds bad and is be confusing.

Why not to deprecate and then remove decapitalize without replacement?

We didn't consider this option in our discussion. We were mainly concentrated on providing replacements for the locale-sensitive API. Looking to our usages in the Kotlin project there are many of them, though mainly for code generation and translation (it was predictable considering the compiler). I believe there are also some use cases in business logic part, e.g. decapitalizing a sentence.

What is the result of the discussion?

We understood that the chosen names are debatable and there might be something we overlooked. Therefore we decided to mark the proposed functions as experimental to have ability to rename them or change their behavior after receiving feedback from users. We decided to write this KEEP and invite our community for feedback.

qurbonzoda avatar Oct 20 '20 02:10 qurbonzoda