html Encourage always-escaping ampersand character.

In the example highlighting ambiguities from missing semicolons on named character references, a "correct" encoding is provided, but that example makes no mention of the fact that the fragment was ambiguous precisely because the ampersand wasn't escaped.

This patch adds a clarifying note explaining how this situation is avoided by always escaping the ampersand.

[ ] At least two implementers are interested (and none opposed):
- …
- …
[ ] Tests are written and can be reviewed and commented upon at:
- …
[ ] Implementation bugs are filed:
- Chromium: …
- Gecko: …
- WebKit: …
- Deno (only for timers, structured clone, base64 utils, channel messaging, module resolution, web workers, and web storage): …
- Node.js (only for timers, structured clone, base64 utils, channel messaging, and module resolution): …
[ ] Corresponding HTML AAM & ARIA in HTML issues & PRs:
[ ] MDN issue is filed: …
[ ] The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)

Dec 04 '25 19:12 dmsnell

As a side note, I overlooked adding my name to the list of contributors in my first submission.

Dec 04 '25 19:12 dmsnell

I was surprised to find no recommendation about escaping & with character references anywhere in the HTML standard. The section this PR touches seems to encourage not escaping & if it is not ambiguous (bold mine):

Thus, the correct way to express the above cases is as follows:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->

<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->

I read this as if &ted would be wrong in some way, since it isn't the correct way. However, it seems much simpler to me to escape the ampersand here as &.

I would change this section to something like the following:

-<!-- &ted is ok, since it's not a named character reference -->
+<!-- "&ted" is ok because "ted" is not a named character reference. 
+<!-- "&amp;ted" is equivalent and less error-prone because "&amp;" explicitly decodes to "&". -->

There is precedent for such a recommendation. Section 4.12.1.3 Restrictions for contents of script elements has a prominent note with an encoding recommendation:

The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape an ASCII case-insensitive match for "<!--" as "\x3C!--", "<script" as "\x3Cscript", and "</script" as "\x3C/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions. Doing so avoids the pitfalls that the restrictions in this section are prone to triggering: namely, that, for historical reasons, parsing of script blocks in HTML is a strange and exotic practice that acts unintuitively in the face of these sequences.

Section 13.1.4 Character references seems like a good place to add a similar note. For example

[!NOTE] Where character references are allowed, it's a good idea to always encode & with its character reference &. This prevents any ambiguity as to whether the & is part of a character reference or a literal &.

I would consider mention the most common characters that are useful to escape in different contexts, but the note about & seems particularly helpful.

Dec 05 '25 10:12 sirreal

https://html.spec.whatwg.org/multipage/syntax.html#character-references already requires this so I'm not sure we need to state it again in the parser section. Is the problem that the parser doesn't flag it?

Dec 05 '25 14:12 annevk

Is the problem that the parser doesn't flag it?

I believe the problem here is that the illustrative example in the syntax-error section explicitly states that the correct way to produce HTML text containing & is to not escape it if what follows is not a legitimately-parsed character reference.

The example illustrates that a parser will correctly identify &ted as that raw string, but suggests that &ted is more appropriate than &ted.

So basically this is just a confusing aspect for implementers and it seems like we could tweak the wording to maintain the demonstration of how these errors are handled without encouraging people to lean on syntax errors in cases where they produce the right output.

Dec 05 '25 17:12 dmsnell

I see, this is part of https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors.

We don't disallow &ted currently so unless we also change the HTML Writing requirements in some way I'd be a bit hesitant to change it in this one place.

Dec 05 '25 17:12 annevk

@annevk thanks. I’m very open to trying out different ideas, but I think the spec is actually a bit vague on this.

already requires this

Unless I’m wrong, the spec does not require that & be escaped as &, only that when mixing character references with text that they must begin with & and be followed by the correct syntax.

However, if someone is authoring HTML and not intending to produce a character reference, a stray & is both properly decoded by the parser and not forbidden.

I think we all agree that the intention is to always escape & as &, but in the nitty gritty, unless it’s hidden in some other section none of us have scoured up yet, it’s not explicitly normalized as such. The only reference we’ve been able to find that isn’t implied is the one in this PR, where the spec assertively states that it’s correct to omit the escaping.

Dec 05 '25 17:12 dmsnell

I apologize for omitting the before/after screenshots, but I took a before shot and was waiting to add it to the description until I had the parser previews generated but then they never appeared and I forgot to upload the before-shot anyway. Here is the relevant context from the modified section.

Dec 05 '25 17:12 dmsnell

That's what I'm saying as well though in my latest comment. The Writing section explicitly allows you to do this. So I don't want to accept this PR as-is, as it'll contradict the Writing section.

@zcorpan was involved in some of the details here and should probably weigh in.

Dec 05 '25 17:12 annevk

sounds great, and I have no wish that this be as-is. in fact, I was hoping for further input because I myself struggled to figure out how best to represent it. @sirreal is the author of the original suggestion.

interestingly enough, the HTML 3 spec was clearer on this point, but that entire document comprises only a handful of ill-defined paragraphs 🙃

Because certain characters will be interpreted as markup, they should be represented by markup…for instance the character "&" must be represented by the entity &.

Dec 05 '25 18:12 dmsnell

I think it's worth considering switching to require escaped ampersands. The rules for when it's allowed are non-trivial and it's surprising that &ted is OK but &copy is not OK, or that the behavior is different between in data and in attribute values.

Always escape & is clear and easy to understand.

This was my position in 2007 also: https://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-September/012457.html

cc @hsivonen @sideshowbarker

Dec 09 '25 13:12 zcorpan

Always escape & is clear and easy to understand.

This is what I'd really like to address with at least a recommendation in the HTML standard that & is best escaped where applicable.

@dmsnell linked to the HTML3 spec. HTML4 also makes a recommendation:

Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter).

Escaping & is something we understand implicitly and it's apparent in functions like PHP's htmlspecialchars or Python's html.escape.

An explicit recommendation in the standard about & escaping would be a service to web developers.

Dec 09 '25 15:12 sirreal

I think we should make it a parse error if we change this.

Dec 11 '25 13:12 zcorpan

I think we should make it a parse error if we change this.

I would rather we don't. I say that because, I don't actually want to implement an error or warning for this in the checker — despite whatever the spec may end up being changed to say here. I don't think it will actually be good for users to be getting new errors or warnings from the checker about this.

But if it's made an actual parse error in the spec, I would somewhat be forced into it, regardless — because for errors from the HTML parser, the checker basically just bubbles all those up as-is.

That said, I would also not personally implement a parse error for it in the HTML parser sources. But there's nothing that would prevent any other contributor (or code owner) for the parser code from implementing it.

Dec 11 '25 23:12 sideshowbarker

Thanks @sideshowbarker .

I think unescaped ampersand falls into at least: https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors

Unintuitive error-handling behavior (different parsing in data vs attribute values is unintuitive)
Errors involving fragile syntax constructs (there are 2000+ named charrefs, knowing when & followed by text is ok is hard)

It's true that a new check means people will be presented with errors that were previously ok, which is a cost. But we improve the learnability of HTML and could avoid errors where entities are replaced but they were intended to be text.

Dec 12 '25 20:12 zcorpan

The problem is that virtually every <a> will trigger this error, if it contains any query parameters. It's uncommon for people to escape the & separating params.

Yes, it's confusing that in <a href="foo?bar&copy=bar">foo?bar&copy=bar</a> the attribute works correctly but the text shows a copyright symbol, but forcing checkers to flag all such links as invalid would be a huge issue, I think.

Dec 12 '25 21:12 tabatkins