core icon indicating copy to clipboard operation
core copied to clipboard

Char.toUpper('ß') returns two characters as Char

Open malaire opened this issue 6 years ago • 6 comments

Char is defined to be a single Unicode character and Char.toUpper is defined as returning a single Char.

But Char.toUpper('ß') returns two Unicode characters as single Char. While returned value is correct according to Unicode specification ('ß' is uppercased to two characters), it is not correct according to Elm specification of Char and Char.toUpper.

$ elm repl
> Char.toUpper('ß')
'SS' : Char

UPDATE: There are also other such characters where case conversion results in different number of characters, for example Char.toUpper('\u{FB02}') returns 'FL' : Char. Full list seems to be available at ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt

malaire avatar Nov 12 '18 12:11 malaire

Comparing to other implementations, in Haskell Data.Char.toUpper :: Char -> Char returns ß unchanged, while Data.Text.toUpper :: Text -> Text converts ß to SS.

malaire avatar Nov 12 '18 15:11 malaire

It seems that German speakers had and solved the same problem.

In 2017, the Council for German Orthography ultimately adopted capital ß (ẞ) into German orthography, ending a long orthographic debate. — Wikipedia referencing Quartz

I think Elm can ~do away with~ alleviate this problem by casing Unicode U+223 ß into U+7838 and vice versa.

edgerunner avatar Aug 01 '20 21:08 edgerunner

I think Elm can do away with this problem by casing Unicode U+223 ß into U+7838 ẞ and vice versa.

That would be incompatible with Unicode standard, so if Elm wants to follow Unicode standard it can't do this.

Also ß isn't the only Unicode character for which case-conversion results in more characters - there are well over 100 such characters. So "fixing" this one case in a way that is not compatible with Unicode would be wrong in my opinion, as that won't fix the whole problem and also makes Elm Unicode incompatible.

(I updated OP with a comment that ß isn't the only such character.)

malaire avatar Aug 02 '20 09:08 malaire

Then it seems that we need to accept that casing cannot be Char -> Char. There are already String -> String casing functions that work as expected, (probably following the JS engine implementation), and we can have others that can't technically go wrong. The first ones that come to my mind:

  • Char -> String: The simplest… Forces you to unnecessarily consider empty strings in the output
  • Char -> List Char: Technically the same, but somewhat more explicit.
  • Char -> Maybe Char: Just sweep it under the rug. Get Nothing if the Unicode casing is not a proper Char
  • Char -> ( Char, String ): Output all probabilities. You can map it as needed, but if you don't check for it, you still can get 'SS' : Char and friends, or get a single 'S' : Char for a 'ß' : Char. (Thanks @dullbananas)
  • Char -> Cased: A new type to handle specific requirements. Needs to be designed to handle all possibilities properly, and can be built to handle other casing oddities like my native Turkish dotless ı to I and i to İ An incomplete implementation could be:
    type Cased
      -- This is the most usual case. A Char for a Char
      = Single Char
      -- This means there are more Char's after the first.
      | Multi Char Cased
      -- Lang is the alias for however we define the language; String, union etc.
      -- The first Cased is for when the language matches, the second is for when it doesn't.
      | LanguageDependent Lang Cased Cased
    
    The ß problem could be expressed as
    Char.toUpper 'ß' => Multi 'S' (Single 'S')
    
    Similarly the Turkish i can be output as
    Char.toUpper 'i' => LanguageDependent Tr 'İ' (Single 'I')
    
    This is just a preliminary idea that needs further exploration to cover all the casing oddities properly. There are so many languages out there.

edgerunner avatar Aug 03 '20 20:08 edgerunner

If it shouldn't be Char -> Char then it should be `Char -> ( Char, String )

dullbananas avatar Aug 03 '20 20:08 dullbananas

It might be best to keep the existing Char -> Char function and make it output the same character if the result can't be a single character, and create a separate Char -> String function.

dullbananas avatar Aug 03 '20 20:08 dullbananas