phobos icon indicating copy to clipboard operation
phobos copied to clipboard

`std.uni.icmp` does not compare `ß` or `ẞ` correctly aganist `SS`

Open jmdavis opened this issue 8 months ago • 7 comments

Okay, as I understand it, ß and are German letters, with ß being equivalent to ss and being equivalent to SS, though was apparently only officially recognized in 2017.

The current Unicode spec seems to consider the standard case mappings to be that "ß".toUpper() == "SS" and "ẞ".toLower() == "ß", making it so that they're not covariant (or whatever the correct term is) - i.e. you can't do foo.toLower().toUpper() or foo.toUpper().tolower() and always expect to get foo back.

std.uni.toLower and std.uni.toUpper seem to follow that rule correctly. The problem is std.uni.icmp.

std.uni.icmp's documentation and tests make sure that lowercase single-letter version compares against ss correctly, but it doesn't appear to check against the uppercase version (though it's a little hard to tell given how similar they are), and it doesn't test against SS. And when doing a simple test against SS, icmp fails with both the lowercase and uppercase versions.

    assert(icmp("ß", "ss") == 0);
    assert(icmp("ß", "SS") == 0); // fails

    assert(icmp("ẞ", "ss") == 0);
    assert(icmp("ẞ", "SS") == 0); // fails

So, for whatever reason, ss is handled correctly, but SS is not.

I haven't read through the Unicode spec in painstaking detail on this, but it is quite clear that SS is supposed to be considered the uppercase version of ß and equivalent to , so I don't think that there's much question that what icmp is doing with regards to SS is wrong given that it's explicitly supposed to be handling these cases where a single letter can be the same as two letters (whereas sicmp doesn't handle that case).

On casing: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-4/#G124722

From https://www.unicode.org/charts/PDF/U0080.pdf

00DF ß LATIN SMALL LETTER SHARP S
= Eszett
• German
• not used in Swiss High German
• uppercase is “SS” (standard case mapping),
alternatively 1E9E ẞ
• typographically the glyph for this character can
be based on a ligature of 017F ſ with either
0073 s or with an old-style glyph for 007A z
(the latter similar in appearance to 0292 ʒ ).
Both forms exist interchangeably today.
→ 017F ſ latin small letter long s
→ 0292 ʒ latin small letter ezh
→ 03B2 β greek small letter beta
→ 1E9E ẞ latin capital letter sharp s
→ A7B5 ꞵ latin small letter beta
→ A7D7 ꟗ latin small letter middle scots s

From https://www.unicode.org/charts/PDF/U1E00.pdf

Addition for German typography
The capital letter sharp s is part of the official German
orthography since 2017. Along with "SS" it is an allowed
variant spelling of 00DF in "all caps" style.
1E9E ẞ LATIN CAPITAL LETTER SHARP S
• not used in Swiss High German
• lowercase is 00DF ß
→ 00DF ß latin small letter sharp s
→ A7D6 Ꟗ latin capital letter middle scots s

jmdavis avatar Mar 27 '25 07:03 jmdavis

The uppercase ß is a strange thing. No word in german starts with it. There are also no words that start with two uppercase characters. So even if a word like ßoll existent (and yes I can't even regularly type an uppercase ß on my german keyboard) the word SSoll would be a spelling mistake.

It is as far as I understand only a typographic thing to write all uppercase words include ß characters. e.g. GROß just looks strange.

IMO follow the unicode advice

p.s. in 1996 Germany had a Rechtsschreibreform which changed the usage of ß a lot, but as its quite recent, many people use ß more than is technically correct.

burner avatar Mar 27 '25 08:03 burner

Okay I thought I had explained this in a bug ticket but it doesn't look like it.

What icmp is doing is called case folding, where the cases are converted into a single consistent set of characters. You are either in simple mode, where one dchar equals one dchar, which is what you are noting as working.

On the other hand there is language specific case folding, these are variable length one dchar to one or more dchar's. This requires a new table to be generated, and one that it not simple at all. ~~On top of this you need some global state of what the default language should be.~~

Here is the table: https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt

Up till now I have been of the belief that icmp was just missing the language-specific stuff, but if it's not case folding the against argument then no, it's going to need a full rewrite. Now that I am looking at it, yes it does appear to only do the input side.

rikkimax avatar Mar 27 '25 10:03 rikkimax

e.g. GROß just looks strange.

Nevertheless, it’s correct. AFAIK the traditional spelling — i.e. GROSS — is considered an alternative form only nowadays .

0xEAB avatar Mar 27 '25 23:03 0xEAB

though was apparently only officially recognized in 2017.

Correct.

Translation of https://www.rechtschreibrat.com/DOX/rfdr_PM_2017-06-29_Aktualisierung_Regelwerk.pdf:
By permitting the capital letter “ẞ”, [the changes] create an option that allows retaining the “ß” when writing in capitals besides the spelling with “SS”.

0xEAB avatar Mar 27 '25 23:03 0xEAB

e.g. GROß just looks strange.

Nevertheless, it’s correct. AFAIK the traditional spelling — i.e. GROSS — is considered an alternative form only nowadays .

Wikipedia claims that it was decided in 2024 that the single-letter version would be preferred, so assuming that that's correct, it's a recent decision. I have no clue whether the next Unicode spec will be changed to reflect that or not, but either way, I think that the key thing that we need to worry about is whatever the Unicode spec says to do. If there are places where it doesn't specify, then we'll have to make a decision, but from what I can tell, the single-letter uppercase version is considered one of the alternates by the current Unicode spec (which also came out in 2024, so it likely wasn't affected by whatever the German typography guys decided), whereas SS is considered the main version.

But regardless of which is the main version and which is an alternative, if the Unicode spec says that they're equivalent, then icmp should be treating them as equivalent given that it's supposed to be taking the multi-character versions into account, and for whatever reason, it only partially does that right now. Rikki seems to understand what's going on far better than I do though. I've dealt with the std.utf quite a bit in the past, but I haven't ever really dug into std.uni. As it is, I only ran into this issue because it came up at work, and so I did some digging into the Unicode spec to see what the correct behavior was supposed to be vs what Phobos was doing.

jmdavis avatar Mar 28 '25 00:03 jmdavis

We don't need to figure out what to do here. What Unicode says goes.

The moment we go against the experts on the subject and create our own tables and algorithms against what is described it'll be hell long term.

rikkimax avatar Mar 28 '25 01:03 rikkimax

We don't need to figure out what to do here. What Unicode says goes.

The moment we go against the experts on the subject and create our own tables and algorithms against what is described it'll be hell long term.

Yeah, as long as Unicode gives precise enough rules that we don't need to decide at any point how we want to handle it, then I think that it's clear that we should just follow the standard. My point was that if the Unicode standard isn't precise enough on something that we have to make a decision, then we'll have to make a decision. But ideally, we don't need that.

Regardless, if the Unicode standard says something, I see no reason to do something else. That's just begging for pain.

jmdavis avatar Mar 28 '25 03:03 jmdavis