case-insensitive Implement (Almost) All The Unicode Caseless Matching Systems

The exception is identifier caseless matching because the normalization operations are not natively supported by java.text.Normalizer and would require even more code.

This change introduces a staggering 12 types of case folded strings. These break down to simple/full case folded strings, each with a default and Turkic variant, and then for each of those we have default, canonical, and compatibility normalization/folding.

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings. This is the most similar form to what CIString was doing before, though I think most users will actually want to use CIStringCF which is canonical Unicode caseless matching on full case folded strings.

Note: There is still quite a bit of work to do on this branch. Tests/Scaladoc/typeclass instances come to mind, but I wanted to get it up for initial thoughts before I did all that.

This is an alternative to #229. There are some obvious cons here.

A major Con is that there are so many types. I don't know that we'd need/want to add CIString variants for each of the types of case folded strings, but even if we just have the types introduced thus far we've gone from 1 CIString to 15 Folded/Caseless strings. That said, this is just what the reality of the Unicode standard dictates. There are a few more succinct encodings I toyed with, but they have some negative code complexity tradeoffs.

An alternative is to just have a single CaseFoldedString which is the result of applying one of the folding algorithms and one of the normalization algorithms. This is what I did in #229, but I'm very wary of it as we lose information about what happened. Two case folded strings could be equal but not really be equivalent because they we derived from different folding/normalization. That's why, despite the verbosity, the independent types seemed like the right choice.

But then again :shrug: . I may have just lost my mind after staring at the Unicode standard this long. :rofl:

Feb 06 '22 18:02 isomarcte

Tests are expected to fail right now. This branch currently still has the tests I added for CIString to debug/validate the full case folding behavior on CIString. Since CIString is now explicitly using default case less matching with simple folding, it shouldn't pass default case less matching with full folding.

If we move forward here, I'll fix all that up.

Feb 06 '22 18:02 isomarcte

Phew! 😅 Some assorted thoughts:

CIString is now deprecated and defers to CIStringS, which is an input remembering caseless string using default Unicode caseless matching on simple case folded strings.

If CIString is deprecated anyway, why not just let it retain its old implementation, for sake of compatibility? Unless I misunderstood and that is what it was doing.

That's why, despite the verbosity, the independent types seemed like the right choice.

Static types are also better for Scala.js, since the optimizer can detect unused types and elide the relevant code. If this is implemented dynamically, then it no longer can do that. Although not sure if it this will make much difference in practice 😅

I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs.

Feb 06 '22 21:02 armanbilge

If CIString is deprecated anyway, why not just let it retain its old implementation, for sake of compatibility? Unless I misunderstood and that is what it was doing.

My reasoning is that while I think the current implementation of CIString is actually Unicode default caseless matching with simple folding + the turkic rules, I'm not 100% sure on that. It might not actually be exactly any of the 16 different rules. Also, the standard states that the turkic rules should be omitted by default, but the current implementation of CIString includes them.

I'm okay just deprecating though if that is preferred.

I think the hardest thing to stomach is that as a user or library author it's no longer obvious what the "right" choice is. I'm worried that if libraries don't sync up on this, then there will be fragmentation between APIs.

Yeah, I agree. If we keep the other types around I think we should update the github site and basically all documentation to point people towards CIStringCF/CanonicalFullCaseFoldedString unless they know they specifically want something else since the Unicode standard states this will yield the "most correct results". (I'm trying not to freak out too much that we are dealing in ranges of correctness here...)

I on the fence about having the full set of types proposed here, but I'm trying not to be overly biased towards my specific use case. Since they are all in the standard, it's reasonable to think someone might want them eventually. They do have some nice properties, like SimpleCaseFoldedString won't change the length of the String and TurkicSimpleCaseFoldedString is actually the most similar to String.equalsIgnoreCase.

I'm willing to be talked out of this though.

I also hate the names CIStringCF, CIStringCS, and CIStringS, but I was at a lose for a better scheme.

Feb 06 '22 23:02 isomarcte

@rossabaker

This seems like mostly a superset of the other one, so it's probably better to discuss the design there.

This one drops the generic CaseFoldedString for separate types. I think CaseFoldedString may be dangerous.

Feb 07 '22 13:02 isomarcte

@rossabaker @armanbilge I've pushed a new commit which undeprecates and changes CIString to be Unicode canonical caseless matching with full case folding as discussed here.

~See: https://github.com/typelevel/case-insensitive/pull/232/commits/b376e37a65ebd43fac9b8007c9f30dbd6c790fa3~

I ran into a different issue however. The unapplySeq for the ci StringContext is unsound on two counts. The first is that it relies on the length of the CIString, but that is ambiguous under full case folding. The second is that it is using case insensitivity on Char values directly. I don't think you can actually apply any of the Unicode caseless matching systems directly to a Char.

I think we should deprecate it. :slightly_frowning_face:

Edit: I linked the wrong commit, https://github.com/typelevel/case-insensitive/pull/232/commits/619b36ff58abd5c318c59475e13c7ce46c7060f5

Feb 07 '22 14:02 isomarcte

Hmm. Putting aside the semantic changes, all these deprecations to length, unapply, etc. make this seem much more like a breaking change. Like full case folding is fundamentally incompatible/unsound with the API currently exposed.

Feb 07 '22 15:02 armanbilge

Yeah, it's not great. I'm still open to suggestions on what to do here. We could provide an alternative to to the ci syntax, but still not great.

Feb 07 '22 16:02 isomarcte

Right, and that wouldn't help http4s anyway. Seems to me like we can either swallow this pretty breaking change on both the semantic and API level or put this into 2.0.0. IIUC this came up while working on cats-uri which itself is targeting http4s 1.0 I assume.

At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x? /shrug

Feb 07 '22 16:02 armanbilge

My concern with not changing this until http4s 1.x.x is that what CIString does right now is (probably) equivalent to simple case folding with Turkic rules and default caseless matching. When that disagrees with canonical caseless matching with full folding, the result is a bug in the http4s context. Are we okay with that on 0.22.x or 0.23.x?

I'm not sure what the full impact would be when they disagree. In a Uri for example, those would be cause two equivalent regNames to be treated as non-equivalent. Or, in the case of dotless I characters, our current CIString would cause two non-equivalent regNames to compare equal. This might get blocked by DNS rules, DNS adds a bunch of rules on top of RFC-3986, but I'm not sure. That's just one example, I'm sure there are others.

I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.

At the very least, starting a 2.x branch could unblock you for now, and we could think about strategies to get this into 1.x?

Yeah, I'll work towards that in the meantime.

Feb 07 '22 16:02 isomarcte

I mean...I get it...we've no great options and big source breaking changes suck...I'm just thinking out loud here.

Yep me too 😅

Definitely agree that it's a bug in http4s, but I can't begin to understand how serious it is. Defer to Ross on next steps :)

Feb 07 '22 16:02 armanbilge

@rossabaker I'm (finally) circling back here. Do you have any thoughts?

I suspect there are some very deep serious side effects of the incorrect case-folding, but that they are quite rare. Given our user base doesn't seem to be complaining about them right now in http4s, I think we could pivot for version 2 of this library and hope any such invalid behavior is rare enough that we don't see it on 0.23.x of http4s.

Thoughts?

Apr 18 '22 16:04 isomarcte

I have severe apprehensions about introducing a 2.x version. The scala-xml upgrade has been a shit show, not so much due to actual incompatibility, and more because it's setting off alarms in SBT and getting casually upgraded in patch releases downstream because it was advertised as "compatible enough". We risk a similar shitshow as soon as we introduce a 2.x here.

Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written?

I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters.

Jan 04 '23 04:01 rossabaker

Eyeballing this after several months away, it seems like it's binary compatible, and I didn't set off any MiMa alarm bells. Am I correct that this a purely semantic change as written?

I've been distracted by work in https://github.com/typelevel/idna4s and at $WORK, but IIRC this can be made bincompat with the possible exception of the globbing matcher. See https://github.com/typelevel/case-insensitive/issues/297 which I just hastily wrote up.

I would also like to consider benchmarks before changing the core type used in http4s. It seems like this is heavier, but it probably doesn't matter in the context of an IO application? It's worth considering, given that this library is well established with one consumer that actually matters.

Definitely.

Jan 04 '23 14:01 isomarcte

See also my comment in https://github.com/typelevel/case-insensitive/pull/232#issuecomment-1031586847. While we can do this bin-compatibly, it feels like a technicality since so much would be changing. It may be worthwhile however.

It's worth considering, given that this library is well established with one consumer that actually matters.

I recently adopted it in Natchez as well.

Jan 04 '23 14:01 armanbilge