webidl
webidl copied to clipboard
“Unicode character” should likely say “Unicode scalar value” in intro to lexical grammar
The lexical grammar is introduced with:
The tokenizer operates on a sequence of Unicode characters [UNICODE].
The Unicode standard does not define “Unicode characters” as far as I can tell, so that leaves “characters,” which it does define*, but the definitions (plural) don’t seem compatible with how the term is used in this context.
It seems this should probably say USVs. This can be inferred sorta because the string literal interpretation algorithm appears to assume that source text consumed by string
is already known to be exclusively USVs. I don’t know the ins-and-outs of Perl 5.5.8 regular expressions**, but it seems like the grammar given for string
, /"[^"]*"/
, likely doesn’t preclude lone surrogates in itself, which implies that the “operates on a sequence...” statement was meant to establish USVs-only as a prior fact about the input.
* It goes into considerable detail in the “Characters, not glyphs” section of § 2.2 Unicode Design Principles and provides a glossary entry for “character”. Aside: is it really not possible to link to Unicode? PDFs all the way down :(
** Gave up on figuring this out because IIUC perl’s strings have observable encodings and character sets that impact how its regular expressions get interpreted — so it may be the case that there isn’t a single answer to the question “does [^"]
match non-USV code points”.
FYI - here's some related information from other places:
In Infra:
Code points are sometimes referred to as characters and in certain contexts are prefixed with "0x" rather than "U+".
In CSS:
To tokenize a stream of code points into a stream of CSS tokens input, repeatedly consume a token from input until an <EOF-token> is reached, pushing each of the returned tokens into a stream.
It seems that they have not ruled out the surrogate code points, but I am not sure how the situation here is. See also Internationalization Best Practices for Spec Developers.
In CSS it's unambiguously defined in the section prior, 3.3, "preprocessing the input stream":
Replace any U+0000 NULL or surrogate code points in input with U+FFFD REPLACEMENT CHARACTER (�).
In Infra, it's defining the abstract code point spec type, and as is also true in Unicode's terminology, surrogates are code points, but not scalar values. In places where Infra is referenced, there may be constraints or preprocessing steps that disallow or convert non-USV code points (or which permit them to pass through as-is) depending on what's appropriate in context.
Because Web IDL isn't a media type consumed at runtime on the web platform, it tends to be a little fuzzier sometimes about this level of stuff. The usual runtime interop issues that tend to reveal and create pressure to resolve ambiguities aren't in play for non-runtime aspects of the spec. (Gradually increasing explicitness is still helpful for folks like me and others who maintain Web IDL implementations outside of browser internals tho.)
I suggest that we make Web IDL reference Infra for these things and not Unicode directly.
It seems this should probably say USVs. This can be inferred sorta because the string literal interpretation algorithm appears to assume that source text consumed by string is already known to be exclusively USVs.
Great find. My first instinct was that it doesn't really matter since all Web IDL constructs are ASCII. But I guess this part of the spec does indeed assume USVs.
Note that unlike, say, CSS or HTML, Web IDL doesn't really have an "entry point" for parsing, and certainly not one that's web-observable. So the choice here is really a statement about valid Web IDL files, I guess? I.e. we're saying that if some Web IDL-consuming software gets a sequence of bytes which, when decoded*, contains unpaired surrogates, then the result of that software should be a parse error.
* "Decoded": not necessarily UTF-8, as we don't (and IMO shouldn't) state anywhere that .webidl
files must be UTF-8 encoded!
So the choice here is really a statement about valid Web IDL files, I guess?
Yep. It’s totally removed from any kind of observable web platform behavior. If curious, the background for why I noticed it is that I added new API surface to my parser accepting JS string values (to be interpreted as Web IDL source text). Previously it only read well-formed UTF-8 buffers, so USVs were a given, but now surrogates were something I had to consider*. At first I thought lone surrogates should probably pass through the lexer fine if appearing in in string
, comment
, or other
tokens given what’s written, but then I caught the implication for USV string literal interpretation.
* by “had to consider,” I mean I definitely did not have to consider it. i’m the only person using the parser in question and obviously I’m not gonna pass in any unpaired surrogates. but if i don’t invent problems, who will??
Well, should specs be encouraged to write IDL using non-ASCII characters? I guess maybe it could be useful in comments (though keep in mind the existance of https://trojansource.codes/CVE-2021-42574, even if malicious spec authors would normally be sabotaging our semantics, not our IDL), but surely we want string literals to use escapes rather than non-ASCII text? (What do you mean, there are no escape sequences?)