Encourage always-escaping ampersand character.
In the example highlighting ambiguities from missing semicolons on named character references, a "correct" encoding is provided, but that example makes no mention of the fact that the fragment was ambiguous precisely because the ampersand wasn't escaped.
This patch adds a clarifying note explaining how this situation is avoided by always escaping the ampersand.
- [ ] At least two implementers are interested (and none opposed):
- …
- …
- [ ] Tests are written and can be reviewed and commented upon at:
- …
- [ ] Implementation bugs are filed:
- Chromium: …
- Gecko: …
- WebKit: …
- Deno (only for timers, structured clone, base64 utils, channel messaging, module resolution, web workers, and web storage): …
- Node.js (only for timers, structured clone, base64 utils, channel messaging, and module resolution): …
- [ ] Corresponding HTML AAM & ARIA in HTML issues & PRs:
- [ ] MDN issue is filed: …
- [ ] The top of this comment includes a clear commit message to use.
(See WHATWG Working Mode: Changes for more details.)
As a side note, I overlooked adding my name to the list of contributors in my first submission.
I was surprised to find no recommendation about escaping & with character references anywhere in the HTML standard. The section this PR touches seems to encourage not escaping & if it is not ambiguous (bold mine):
Thus, the correct way to express the above cases is as follows:
<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference --><a href="?art&copy">Art and Copy</a> <!-- the & has to be escaped, since © is a named character reference -->
I read this as if &ted would be wrong in some way, since it isn't the correct way. However, it seems much simpler to me to escape the ampersand here as &.
I would change this section to something like the following:
-<!-- &ted is ok, since it's not a named character reference -->
+<!-- "&ted" is ok because "ted" is not a named character reference.
+<!-- "&ted" is equivalent and less error-prone because "&" explicitly decodes to "&". -->
There is precedent for such a recommendation. Section 4.12.1.3 Restrictions for contents of script elements has a prominent note with an encoding recommendation:
The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape an ASCII case-insensitive match for "
<!--" as "\x3C!--", "<script" as "\x3Cscript", and "</script" as "\x3C/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions. Doing so avoids the pitfalls that the restrictions in this section are prone to triggering: namely, that, for historical reasons, parsing ofscriptblocks in HTML is a strange and exotic practice that acts unintuitively in the face of these sequences.
Section 13.1.4 Character references seems like a good place to add a similar note. For example
[!NOTE] Where character references are allowed, it's a good idea to always encode
&with its character reference&. This prevents any ambiguity as to whether the&is part of a character reference or a literal&.
I would consider mention the most common characters that are useful to escape in different contexts, but the note about & seems particularly helpful.
https://html.spec.whatwg.org/multipage/syntax.html#character-references already requires this so I'm not sure we need to state it again in the parser section. Is the problem that the parser doesn't flag it?
Is the problem that the parser doesn't flag it?
I believe the problem here is that the illustrative example in the syntax-error section explicitly states that the correct way to produce HTML text containing & is to not escape it if what follows is not a legitimately-parsed character reference.
The example illustrates that a parser will correctly identify &ted as that raw string, but suggests that &ted is more appropriate than &ted.
So basically this is just a confusing aspect for implementers and it seems like we could tweak the wording to maintain the demonstration of how these errors are handled without encouraging people to lean on syntax errors in cases where they produce the right output.
I see, this is part of https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors.
We don't disallow &ted currently so unless we also change the HTML Writing requirements in some way I'd be a bit hesitant to change it in this one place.
@annevk thanks. I’m very open to trying out different ideas, but I think the spec is actually a bit vague on this.
already requires this
Unless I’m wrong, the spec does not require that & be escaped as &, only that when mixing character references with text that they must begin with & and be followed by the correct syntax.
However, if someone is authoring HTML and not intending to produce a character reference, a stray & is both properly decoded by the parser and not forbidden.
I think we all agree that the intention is to always escape & as &, but in the nitty gritty, unless it’s hidden in some other section none of us have scoured up yet, it’s not explicitly normalized as such. The only reference we’ve been able to find that isn’t implied is the one in this PR, where the spec assertively states that it’s correct to omit the escaping.
I apologize for omitting the before/after screenshots, but I took a before shot and was waiting to add it to the description until I had the parser previews generated but then they never appeared and I forgot to upload the before-shot anyway. Here is the relevant context from the modified section.
That's what I'm saying as well though in my latest comment. The Writing section explicitly allows you to do this. So I don't want to accept this PR as-is, as it'll contradict the Writing section.
@zcorpan was involved in some of the details here and should probably weigh in.
sounds great, and I have no wish that this be as-is. in fact, I was hoping for further input because I myself struggled to figure out how best to represent it. @sirreal is the author of the original suggestion.
interestingly enough, the HTML 3 spec was clearer on this point, but that entire document comprises only a handful of ill-defined paragraphs 🙃
Because certain characters will be interpreted as markup, they should be represented by markup…for instance the character "&" must be represented by the entity &.
I think it's worth considering switching to require escaped ampersands. The rules for when it's allowed are non-trivial and it's surprising that &ted is OK but © is not OK, or that the behavior is different between in data and in attribute values.
Always escape & is clear and easy to understand.
This was my position in 2007 also: https://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-September/012457.html
cc @hsivonen @sideshowbarker
Always escape
&is clear and easy to understand.
This is what I'd really like to address with at least a recommendation in the HTML standard that & is best escaped where applicable.
@dmsnell linked to the HTML3 spec. HTML4 also makes a recommendation:
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter).
Escaping & is something we understand implicitly and it's apparent in functions like PHP's htmlspecialchars or Python's html.escape.
An explicit recommendation in the standard about & escaping would be a service to web developers.
I think we should make it a parse error if we change this.
I think we should make it a parse error if we change this.
I would rather we don't. I say that because, I don't actually want to implement an error or warning for this in the checker — despite whatever the spec may end up being changed to say here. I don't think it will actually be good for users to be getting new errors or warnings from the checker about this.
But if it's made an actual parse error in the spec, I would somewhat be forced into it, regardless — because for errors from the HTML parser, the checker basically just bubbles all those up as-is.
That said, I would also not personally implement a parse error for it in the HTML parser sources. But there's nothing that would prevent any other contributor (or code owner) for the parser code from implementing it.
Thanks @sideshowbarker .
I think unescaped ampersand falls into at least: https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors
- Unintuitive error-handling behavior (different parsing in data vs attribute values is unintuitive)
- Errors involving fragile syntax constructs (there are 2000+ named charrefs, knowing when & followed by text is ok is hard)
It's true that a new check means people will be presented with errors that were previously ok, which is a cost. But we improve the learnability of HTML and could avoid errors where entities are replaced but they were intended to be text.
The problem is that virtually every <a> will trigger this error, if it contains any query parameters. It's uncommon for people to escape the & separating params.
Yes, it's confusing that in <a href="foo?bar©=bar">foo?bar©=bar</a> the attribute works correctly but the text shows a copyright symbol, but forcing checkers to flag all such links as invalid would be a huge issue, I think.