case-insensitive icon indicating copy to clipboard operation
case-insensitive copied to clipboard

Unicode Case Folding

Open isomarcte opened this issue 3 years ago • 12 comments

While working on cats-uri, I ran into an issue with how CIString was handling certain unicode values which led me to notice it wasn't respecting Caseless matching from the Unicode standard. As it turns out, neither does String.equalsIgnoreCase.

I'd just about completed a branch to implement full case folding as defined by the Unicode standard when I ran across this test.

  test("character based equality") {
    assert(CIString("ß") != CIString("SS"))
  }

Since under the Unicode standard's caseless matching these two strings would compare equal, I'm beginning to think we are intentionally not following the standard here. Is that the case? If so, why? Is it to maintain parity with what the Java standard library is doing with methods like equalsIgnoresCase?

isomarcte avatar Feb 05 '22 20:02 isomarcte

I was curious and it seems that JS may support this out-of-the-box. So at least we wouldn't have to copy those tables in 😅

"ß".localeCompare("SS", 'en', {sensitivity: 'base'}) // 0

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare

armanbilge avatar Feb 05 '22 20:02 armanbilge

Ah, here's the equivalent with JDK.

import java.text.Collator
val coll = Collator.getInstance
coll.setStrength(Collator.PRIMARY)
coll.compare("ß", "SS") // 0

armanbilge avatar Feb 05 '22 20:02 armanbilge

@armanbilge I actually already have the lookup tables: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CaseFolds.scala#L23

It only took a few minutes to get emacs to transform the text file into that.

That said, I'm happy to use existing JS/JRE based implementations, but I think we need to actually have a case folded string: https://github.com/isomarcte/case-insensitive/blob/full-unicode-case-folding/core/src/main/scala/org/typelevel/ci/CIString.scala#L43

I didn't see anyway to get that from the Collator. We need the folded string to get a safe/correct implementation of .hashCode for CIString.

isomarcte avatar Feb 05 '22 22:02 isomarcte

Thanks, yes, I actually saw your lookup tables which is what prompted me to look for any stdlib solutions for this ... at least for JS, this is just one more thing to bloat the generated code 🙃

You're right though, if there's no way to get a folded string then your lookup tables are the way to go 👍

armanbilge avatar Feb 05 '22 23:02 armanbilge

I've only poked around in the JRE stdlib. I can check around in JS land to see if we can get a case folded string there and then maybe lighten the generated code and use the lookup tables on the JRE.

isomarcte avatar Feb 05 '22 23:02 isomarcte

No worries, you can stay focused on doing awesome work in cats-uri and I'll worry about the JS stuff. Correctness/functionality first, optimizations second :)

armanbilge avatar Feb 05 '22 23:02 armanbilge

~~@isomarcte what about:~~

scala> "ß".toLowerCase.toUpperCase == "SS".toLowerCase.toUpperCase
val res9: Boolean = true

~~Does this do what we want?~~

Edit: ignore me :)

armanbilge avatar Feb 06 '22 00:02 armanbilge

It looks like you've implemented "full" case folding in #229. Is what we have currently consistent with "simple" case folding? (I could probably answer this myself...)

This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0?

I wonder whether our current implementation varies by Java version. JDK8 supported Unicode 6.2, and Java 17 supports Unicode 13.0. It looks like your proposed algorithm guarantees stability, but only if we also normalize to NKFC?

For each string S containing characters only from a given Unicode version, toCasefold(toNFKC(S)) under that version is identical to toCasefold(toNFKC(S)) under any later version of Unicode.

rossabaker avatar Feb 06 '22 03:02 rossabaker

Ugh, there are four definitions of this.

  • Default caseless matching
  • Canonical caseless matching
  • Compatibility caseless matching
  • Identifier caseless matching

rossabaker avatar Feb 06 '22 04:02 rossabaker

It seems you're targeting the Default definition, which seems as good a starting point as any.

rossabaker avatar Feb 06 '22 04:02 rossabaker

This would be a semantic breaking change, even if it's binary compatible. If we proceed, should this be 2.0?

I'm not sure. It is a semantic change, but most users will be unaffected. What we currently implement appears to be semantically equivalent to the simple case folding rules + rules for some Turkic languages with no normalization.

The Turkic rules are supposed to off by default. That is, this should yield false.

scala> val a: Char = 0x0049.toChar
val a: Char = I

scala> val b: Char = 0x0131.toChar                                                                                      
val b: Char = ı

scala> CIString(a.toString) == CIString(b.toString)
val res0: Boolean = true

I can probably adapt my PR to provide both explicit simple caseless matching and full caseless matching, and then maybe we can give users the option. Let me take a look. I also want to look into the normalization as well.

isomarcte avatar Feb 06 '22 14:02 isomarcte

@rossabaker @armanbilge take a look here for an ~insane~ alternative approach which handles the different normalization and folding rules more explicitly and doesn't break the semantics of CIString.

isomarcte avatar Feb 06 '22 18:02 isomarcte