core
core copied to clipboard
Char.toUpper('ß') returns two characters as Char
Char
is defined to be a single Unicode character and Char.toUpper
is defined as returning a single Char
.
But Char.toUpper('ß')
returns two Unicode characters as single Char
. While returned value is correct according to Unicode specification ('ß' is uppercased to two characters), it is not correct according to Elm specification of Char
and Char.toUpper
.
$ elm repl
> Char.toUpper('ß')
'SS' : Char
UPDATE: There are also other such characters where case conversion results in different number of characters, for example Char.toUpper('\u{FB02}')
returns 'FL' : Char
. Full list seems to be available at ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
Comparing to other implementations, in Haskell Data.Char.toUpper :: Char -> Char
returns ß
unchanged, while Data.Text.toUpper :: Text -> Text
converts ß
to SS
.
It seems that German speakers had and solved the same problem.
In 2017, the Council for German Orthography ultimately adopted capital ß (ẞ) into German orthography, ending a long orthographic debate. — Wikipedia referencing Quartz
I think Elm can ~do away with~ alleviate this problem by casing Unicode U+223
ß into U+7838
ẞ and vice versa.
I think Elm can do away with this problem by casing Unicode
U+223
ß intoU+7838
ẞ and vice versa.
That would be incompatible with Unicode standard, so if Elm wants to follow Unicode standard it can't do this.
Also ß isn't the only Unicode character for which case-conversion results in more characters - there are well over 100 such characters. So "fixing" this one case in a way that is not compatible with Unicode would be wrong in my opinion, as that won't fix the whole problem and also makes Elm Unicode incompatible.
(I updated OP with a comment that ß isn't the only such character.)
Then it seems that we need to accept that casing cannot be Char -> Char
. There are already String -> String
casing functions that work as expected, (probably following the JS engine implementation), and we can have others that can't technically go wrong. The first ones that come to my mind:
-
Char -> String
: The simplest… Forces you to unnecessarily consider empty strings in the output -
Char -> List Char
: Technically the same, but somewhat more explicit. -
Char -> Maybe Char
: Just sweep it under the rug. GetNothing
if the Unicode casing is not a properChar
-
Char -> ( Char, String )
: Output all probabilities. You can map it as needed, but if you don't check for it, you still can get'SS' : Char
and friends, or get a single'S' : Char
for a'ß' : Char
. (Thanks @dullbananas) -
Char -> Cased
: A new type to handle specific requirements. Needs to be designed to handle all possibilities properly, and can be built to handle other casing oddities like my native Turkish dotless ı to I and i to İ An incomplete implementation could be:
The ß problem could be expressed astype Cased -- This is the most usual case. A Char for a Char = Single Char -- This means there are more Char's after the first. | Multi Char Cased -- Lang is the alias for however we define the language; String, union etc. -- The first Cased is for when the language matches, the second is for when it doesn't. | LanguageDependent Lang Cased Cased
Similarly the Turkish i can be output asChar.toUpper 'ß' => Multi 'S' (Single 'S')
This is just a preliminary idea that needs further exploration to cover all the casing oddities properly. There are so many languages out there.Char.toUpper 'i' => LanguageDependent Tr 'İ' (Single 'I')
If it shouldn't be Char -> Char
then it should be `Char -> ( Char, String )
It might be best to keep the existing Char -> Char
function and make it output the same character if the result can't be a single character, and create a separate Char -> String
function.