case-insensitive Some non-ascii unicode chars are not case-folded correctly.

Some non-ascii unicode chars are not case-folded correctly.

Open fisx opened this issue 3 years ago • 3 comments

import qualified Data.CaseInsensitive as CI
import qualified Data.Char as Char

main :: IO ()
main = do
  print ((Char.toLower <$> ("\5042" :: String)) == "\43906")
  print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")

{-
*Main> :main
True
False
-}

Thanks to QuickCheck! :)

Sep 24 '21 07:09 fisx

Oh, interesting: there is Data.Text.toLower, and Data.Text.toCaseFold. But neither is compatible with CI:

import qualified Data.CaseInsensitive as CI
import qualified Data.Text as Text
import Prelude

main :: IO ()
main = do
  print (Text.toCaseFold "\5042" == "\43906")
  print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")

{-
*Main> :main
True
False
-}

Sep 24 '21 08:09 fisx

I think this is actually a bug in Text.toLower: Cherokee lowercase letters (e.g. U+AB82) fold to their uppercase counterparts (e.g. U+13B2). This is implemented incorrectly in text, since the fallback case of foldMapping in https://github.com/haskell/text/blob/master/src/Data/Text/Internal/Fusion/CaseMapping.hs converts every character to lowercase. So we get the strange (and incorrect!) behaviour that U+13B2 and U+AB82 map to each other when folding. See https://github.com/haskell/text/issues/277.

Sep 27 '21 11:09 pcapriotti

It gets weirder:

*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase ("Ꮊ" :: String)
"\43914"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\5050"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\43914"
[...]

Sep 27 '21 18:09 fisx

case-insensitive case-insensitive copied to clipboard

Some non-ascii unicode chars are not case-folded correctly.

case-insensitive
case-insensitive copied to clipboard