julia add hascodepoint(c::AbstractChar) and use it

This PR adds and exports a new function hascodepoint(c::AbstractChar) which returns whether or not codepoint(c) will succeed.

It then uses this function before several calls to codepoint(c) or UInt32(c) instead of !ismalformed(c), which was incorrect because it didn't handle the isoverlong(c) case. Fixes #54343.

(Potentially we could also have an unsafe_codepoint(c) function that assumes hascodepoint(c) == true in order to optimize conversions in cases where hascodepoint(c) has already been checked.)

May 07 '24 17:05 stevengj

I would like more documentation of the distinction between isvalid and hascodepoint. What is a "valid" char if not "hascodepoint"?

May 07 '24 18:05 jariji

Here are two examples where hascodepoint(c) returns true (and codepoint(c) succeeds), but for which isvalid(c) returns false:

c = '\U110000' has a codepoint 0x00110000 but is an invalid Unicode character because codepoints are required by the Unicode standard to be < 0x00110000, now and forever, in order to ensure that codepoints can be represented in UTF-16.

c = '\ud800' has a codepoint U+D800 but is an invalid Unicode character because it is reserved for use in the UTF-16 encoding as part of a surrogate pair.

None of this behavior is new in this PR, and better documentation of isvalid should be a separate issue, but I added some information on this to the hascodepoint docs.

May 07 '24 18:05 stevengj

It feels a little messy that we don't allow getting the code point of an overlong encoding, but I'm sure we had a discussion about that and decided it was too dangerous. It's tempting to make this a keyword modification to isvalid but given that we already have a variety of these predicates, perhaps this one is also good to have.

May 07 '24 20:05 StefanKarpinski

I think the thing that bothers me is that I think of potential string data falling into three broad classes:

Valid. String data that is fully kosher: encoding is sensible and code points that are encoded are good.
Well-formed but invalid. Respects the encoding structure, but what is encoded is no good. For some encodings, like Latin-1 this doesn't exist—every possible sequence of bytes is valid Latin-1. For UTF-16 this is only unpaired surrogates. For UTF-8 this includes: overlong encodings, surrogate code points (since they're reserved for use in UTF-16 and should never appear in UTF-8), and too-high code points.
Malformed. Doesn't even respect the general structure of the encoding, it's just junk. For some encodings like Latin-1 or UTF-16 this doesn't exist because any possible sequence of code units encodes a code point sequence. For UTF-8 there are a few ways this can happen: a) byte with 5+ leading bits, b) unexpected continuation byte, c) lead byte followed by too few continutation bytes.

Because of the decision to have codepoint(c) return an error for overlong encodings, hascodepoint(c) cuts across these classes in a messy way: it returns true for all of class 1 (valid) and false for all of class 3 (malformed), but it's a mix for class 2 (well-formed but invalid)—false for overlong encodings, true for the rest. Maybe this is fine, but that's where my misgivings come from.

May 07 '24 21:05 StefanKarpinski

@StefanKarpinski, so one alternative would be to have codepoint(c) return a value for overlong encodings. In that case we can remove hascodepoint and should probably export ismalformed and document that codepoint should succeed for any non-malformed c.

I'm fine with this, though I'm not sure what the other implications would be. It doesn't seem like it would be too breaking since it would take a case that currently throws and make it non-throwing.

May 08 '24 00:05 stevengj

one alternative would be to have codepoint(c) return a value for overlong encodings. In that case we can remove hascodepoint and should probably export ismalformed and document that codepoint should succeed for any non-malformed c.

If the encoding is still unique, I like normalizing in codepoint. It's IMO better to silently fix potentially broken code that could unambiguously do the right thing than force modification/fixes by adding hascodepoint.

Of course, if the overlong encoding is not unique (this should never be the case, right? Any overlong Char must map to one and only one codepoint, right?), we'll still have to throw either way.

May 08 '24 05:05 Seelengrab

To play devil's advocate to myself, I think the worry about overlong encodings is that someone might sneak a code point that's not supposed to get through some filter and then later cause trouble. The classic example is using '\xc0\x80' to sneak code point U+0000 into null-terminated strings. This is done on purpose in Modified UTF-8, for example.

To try to steelman this, I'm imagining someone writing code that doesn't call isvalid and examines each character's code point by calling codepoint(c) and they assume that if the string is invalid, one of those calls will error. However, that would already not be the case since codepoint('\Ud800') and codepoint('\xf4\x90\x80\x80') both return values rather than erroring. Currently, the only kind of invalidity that approach would catch is overlong encodings. I'm not at all clear on why overlong encodings are uniquely dangerous. It seems find to me to have isvalid('\xc0\x80') == false and codepoint('\xc0\x80') == 0. In that case we would simply have ismalformed as the predicate that distinguishes whether something encodes a code point or not. The only case I can see where there would be a problem is code that checks for a specific character and then later assumes that character's code point cannot occur and that assumption being false due to overlong encodings. Is that dangerous enough to really warrant making everything messier?

May 08 '24 13:05 StefanKarpinski

I'm leaning towards it's fine for codepoint(overlong) to return the encoded code point instead of erroring, and then we can just export and document ismalformed and use !ismalformed instead of hascodepoint.

May 08 '24 14:05 StefanKarpinski

The only case I can see where there would be a problem is code that checks for a specific character and then later assumes that character's code point cannot occur and that assumption being false due to overlong encodings. Is that dangerous enough to really warrant making everything messier?

I don't know about dangerous, but that should be fixable too, no? We'll just have to make ==(::Char, ::Char) compare the codepoints instead of the bitwise representations, as is done now. That would make '\0' in my_str return true even if the null byte is encoded as overlong. Might be considered slightly breaking though, since people relying on that would have to switch to ===.

One caveat to that is that even with comparing codepoints in ==, if someone then looks at the raw bytes instead of the Char-iterator abstraction, they may be confused if they don't know about overlong encodings.

May 08 '24 14:05 Seelengrab

We'll just have to make ==(::Char, ::Char) compare the codepoints instead of the bitwise representations, as is done now

I don't like that approach, since it will mean string1 == string2 is no longer equivalent to all(collect(string1) .== collect(string2)). (It will also slow down character comparisons.)

May 08 '24 15:05 stevengj

We have a lot of character predicates (islower, category codes, etc) that call utf8proc with the code point if the character is not malformed. What should these do with overlong encodings of valid codepoints?

May 08 '24 16:05 stevengj

I agree with @stevengj. Having regular::Char == overlong::Char seems very iffy to me. I do see the point that right now we have the property that two strings are equal if they encode the same sequence of code points.

May 08 '24 19:05 StefanKarpinski

My reading of that is that there would still be a distinction between whether the two strings encode the same sequence of (our) Char object which handles those distinctly or the same sequence of (Unicode) characters (converted to UInt32), which may consider them to be the same or error on some of them (and not even looking at normalizations at the Unicode level)

May 08 '24 19:05 vtjnash

it will mean string1 == string2 is no longer equivalent to all(collect(string1) .== collect(string2)). (It will also slow down character comparisons.)

Yeah, that's certainly a bad tradeoff. Let's just stick to codepoint normalizing to non-overlong encoding then.

I do see the point that right now we have the property that two strings are equal if they encode the same sequence of code points.

Do we? IIRC we currently compare by memory contents, not encoded codepoints:

julia> "\xc0\x80" == "\0"
false

May 08 '24 19:05 Seelengrab

@StefanKarpinski, do you have an opinion regarding my question about character predicates above?

e.g. should isspace('\xc0\xa0') == true, since this is an overlong encoding of ' ' == '\u0020'? Currently it returns false, but some other predicates throw an exception, like islowercase('\xc0\xa0')

My inclination is that they should all return false, i.e. they should check ismalformed(c) | isoverlong(c), or simply isvalid(c) (though this does some extra checks that are redundant with utf8proc), before passing the codepoint to utf8proc.

How should overlong characters be displayed? Should they show the codepoint? Currently they display without showing the codepoint:

julia> '\xc0\xa0'
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)

(I'm basically a bit confused about what people are supposed to do with overlong characters that justifies returning a codepoint for them.)

May 09 '24 19:05 stevengj

I do feel that having code point predicates like isspace or islowercase return true for overlong encodings is iffy. My gut inclination leans towards throwing an error if the argument is not valid Unicode. There are two reasons in my mind for allowing codepoint(c) for well-formed but invalid characters:

codepoint(c) already works for other invalid but well-formed character like surrogates and high code points. What makes overlong encodings so different from these?
If we don't allow codepoint(c) for well-formed but invalid characters, we still should provide some way to get the code point. Doesn't it make the most sense for that way to be codepoint(c)?

I can see a case for making codepoint(c) strict by default but having codepoint(c, strict=false) which will decode anything that is not malformed(c), but in that case the strict version should imo also error on surrogates and too-high code points.

How should overlong characters be displayed? Should they show the codepoint? Currently they display without showing the codepoint:
julia> '\xc0\xa0'
'\xc0\xa0': [overlong] ASCII/Unicode U+0020 (category Zs: Separator, space)

Isn't that the code point show in there in the description?

May 23 '24 14:05 StefanKarpinski

julia julia copied to clipboard

add hascodepoint(c::AbstractChar) and use it

julia
julia copied to clipboard