commonmark-spec Raw HTML does not support HTML5 attribute names

Raw HTML does not support HTML5 attribute names

Open srawlins opened this issue 8 years ago • 6 comments

HTML5 attribute names are specified as:

Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/" (U+002F), and "=" (U+003D) characters, the control characters, and any characters that are not defined by Unicode.

This differs from CommonMark's text, already noted in 0.22:

An attribute name consists of an ASCII letter, _, or :, followed by zero or more ASCII letters, digits, _, ., :, or -. (Note: This is the XML specification restricted to ASCII. HTML5 is laxer.)

The one real-world use case of this that I am aware of is the upcoming Angular 2. Examples look like:

<input #todotext>
<button (click)="addTodo(todotext.value)">Add Todo</button>
<li *ng-for="#todo of todos"></li>

(from their User Input docs)

Sep 06 '15 23:09 srawlins

Holy s**t, they use # as a name start character? Or what is this "#todotext" string there? I know that HTML5 proper is neither based on SGML nor on XML, but this is really strange: the # character has been the "reserved name indicator" character since forever (since ISO 8879:1986 at least, when SGML became official). You can see it today (in XML DTDs) in the #PCDATA spelling.

[You can probably tell that I pretty much ignored HTML5 so far ;-)]

Nov 05 '15 00:11 tin-pot

@tin-pot that's not an HTML5 thing, it's an angular.js thing.

Jan 01 '16 17:01 sethdill

@sethdill:

that's not an HTML5 thing, it's an angular.js thing.

Thanks for the hint!

Given the syntax rules for attribute names in HTML5 (as quoted by the OP above) however, it seems that HTML5 just does not make use of these "improved" rules (at least not so far?), but HTML5 does certainly allow an attribute name to start with "#" (U+0023 NUMBER SIGN): I even looked it up, and can confirm that this is actually the case.

This is shocking and creepy enough for my taste, the "peculiar" choices for angular.js names notwithstanding. — The other day I really have to take a deep breath and work my way through the HTML5 specification, and I'm really afraid that hidden therein are many more such "surprises" …

Does anybody know how an attribute name like "#todotext" is supposed to be used in XHTML 5? Did "they" redefine XML for that purpose, too?

Or is "their" answer just "bad luck, we never promised that HTML5 can be faithfully transformed into (any form of) XML, so why don't you just use our nice, new and improved HTML5 syntax?" — Yuck!

And how is one supposed to parse such abominations if not with an XML parser? Because you certainly can't use an XML parser for that, as every such parser is required to reject documents containing "attribute names" like "#todotext" as non-well-formed. And you better not even try to use an SGML parser, I'm pretty sure that "WebSGML" extensions won't help much there either.

I'm just completely baffled and appalled by these kinds of "improvements" that HTML5 is quite casually bringing into the world.

Jan 01 '16 21:01 tin-pot

I hear you. It's good to be smarter than that whole group. It's not like anything has come of HTML5.

;-)

Who wants an app-like web experience anyway? Yuck. We should go back to the days of browser-makers doing things their own way and competing instead of cooperating. I remember how much fun they were!

Sorry, I'm just teasing. I do find the idea that an attribute name can start with a # to be a little weird, but it's being used in the wild already and keep in mind that unrecognized stuff is just ignored.

Jan 01 '16 21:01 sethdill

Well, I certainly admit that good things have come out of HTML5—the rapid decline into obsolescence of Adobe Flash alone was probably worth it!

And I have nothing against adding new and useful elements like the whole range of media-related stuff, or "simple" things like <section>.

But what really puzzles me is the decision to throw out the syntactic foundation of both XML and HTML 4, and to instead introduce a whole new, third, independently defined "HTML5 syntax" which seems to only superficially resemble what we (and parsers) understood (X)HTML to be in the past.

But as I have said, I have yet to make my way through the HTML5 specification and rationale, so I'm writing a bit "tongue in cheek" (to use a distorted metaphor).

My question about ways to parse a "HTML5" document which contains attribute names like "#todotext" however was not at all for polemic reasons alone: I just would not know how to proceed with it, for the reasons I stated (and I do recognize that HTML5 itself does not use such attribute names).

Would this not have an effect similar to "going back to the days of browser-makers doing things their own way", just that this time it's rather "parser" or "tool-chain" makers?

Up to now, one can pretty easily transform (valid) HTML 4.01 (and even more easily the stricter ISO 15445:2000 variant) into XML (or just parse it with an SGML parser), and then go on processing it in any way one likes—even in spite of the different basic syntaxes [namely the variant SGML reference concrete syntax declared in the "SGML declaration for HTML" for the former, and "XML syntax" for the latter (the XML syntax can be seen as a special variant or mode of SGML syntax, too)].

But it looks that with the new, third "HTML5 syntax"

neither XML nor SGML nor "traditional" HTML parsers can reliably parse "general" HTML5, and
even a "native" HTML5 parser can't create an equivalent XHTML document from "general" HTML5 input.

These are huge drawbacks regarding interoperability in my point of view, bringing us right back to "competing instead of cooperating", as one could say. And I fail to see the advantage of the "HTML5 syntax" which presumably should outweigh and compensate these drawbacks.

Or am I missing something important here? (Other than the advice that in this light it is probably unwise to use element and attribute names which are illegal in both XML and "traditional" HTML.)

Jan 01 '16 22:01 tin-pot

@sethdill

It's good to be smarter than that whole group. It's not like anything has come of HTML5.

This reminds me of a rather funny comment made by Norman Walsh, regarding proposals to remove processing instructions from the XML syntax:

«I am totally exasperated by the folks that want to remove PIs from XML. "Here's your Swiss Army Knife, norm, oh, but we broke off the small blade (the internal subset) and we've removed the tweezers (PIs), because you don't really need those. And for good measure we welded the corkscrew open (thou shalt always put elements in a namespace). Is there anything else we can do to help you?"» – N. Walsh, 2003-01-24

What has this to do with HTML5? Well, the HTML5 syntax has a special "state" with the cute name bogus comment state, and lo and behold: this is where an HTML5 parser ends up in when a PI is encountered—be it a valid HTML 4 PI, or a valid XML PI (which are different).

So—no: not even a group like the W3C or WHATWG should be exempt from any questioning.

But then again, I still might find satisfying answers using the old-fashioned technique of reading the f***ing specification and rationale for HTML5.

I'll be back as soon as I have done this …

Jan 01 '16 23:01 tin-pot

commonmark-spec commonmark-spec copied to clipboard

Raw HTML does not support HTML5 attribute names

commonmark-spec
commonmark-spec copied to clipboard