case-insensitive
case-insensitive copied to clipboard
Some non-ascii unicode chars are not case-folded correctly.
import qualified Data.CaseInsensitive as CI
import qualified Data.Char as Char
main :: IO ()
main = do
print ((Char.toLower <$> ("\5042" :: String)) == "\43906")
print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")
{-
*Main> :main
True
False
-}
Thanks to QuickCheck! :)
Oh, interesting: there is Data.Text.toLower
, and Data.Text.toCaseFold
. But neither is compatible with CI
:
import qualified Data.CaseInsensitive as CI
import qualified Data.Text as Text
import Prelude
main :: IO ()
main = do
print (Text.toCaseFold "\5042" == "\43906")
print ((CI.foldCase (CI.mk ("\5042" :: String))) == "\43906")
{-
*Main> :main
True
False
-}
I think this is actually a bug in Text.toLower
: Cherokee lowercase letters (e.g. U+AB82) fold to their uppercase counterparts (e.g. U+13B2). This is implemented incorrectly in text
, since the fallback case of foldMapping
in https://github.com/haskell/text/blob/master/src/Data/Text/Internal/Fusion/CaseMapping.hs converts every character to lowercase. So we get the strange (and incorrect!) behaviour that U+13B2 and U+AB82 map to each other when folding. See https://github.com/haskell/text/issues/277.
It gets weirder:
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase ("Ꮊ" :: String)
"\43914"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\5050"
*Wire.API.User.RichInfo Scim Data.CaseInsensitive> foldCase it
"\43914"
[...]