HTML API: Lower-case HTML tag names in `get_qualified_tag_name()`.
Trac ticket: Core-61576.
Since this method is meant for printing and display, a more expected return value would be the lower-case variant of a given HTML tag name.
This patch changes the behavior accordingly. No tests are impacted by this change.
Diff best viewed ignoring whitespace changes.
Follow-up to [58867].
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.
Core Committers: Use this line as a base for the props when committing in SVN:
Props dmsnell, jonsurrell.
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.
Test using WordPress Playground
The changes in this pull request can previewed and tested using a WordPress Playground instance.
WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.
Some things to be aware of
- The Plugin and Theme Directories cannot be accessed within Playground.
- All changes will be lost when closing a tab with a Playground instance.
- All changes will be lost when refreshing the page.
- A fresh instance is created each time the link below is clicked.
- Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.
@sirreal I don't feel strongly about this, but I do think that if we want to consider the change it'd be best to do before 6.7 is released, as after that it would be a backwards-compatibility break. I've expanded the docblock with a comparison to get_tag().
I've thought about this some more and I don't think we should make this change.
Since this method is meant for printing and display, a more expected return value would be the lower-case variant of a given HTML tag name.
I'm not convinced this is the purpose of the method, although it depends what is meant by "printing and display."
Primarily, HTML API is for working with HTML input and HTML output. In HTML, the casing of tag names is irrelevant. There's no reason for svg tags to be "correctly" cased (<altGlyph> instead of <altglyph>), just like there's no reason to use upper or lower or any particular casing for HTML tag names. This method handles the correct casing for SVG element tag names, but that's not important for serializing and printing the svg tags in an HTML document.
It doesn't seem more correct to me to lowercase HTML tag names, MathML tag names, and then used some mixed casing on SVG tag names when the element name differs.
Here's my take on this method.
This method applies the rules from the specification on parsing foreign content:
If the adjusted current node is an element in the SVG namespace, and the token's tag name is one of the ones in the first column of the following table, change the tag name to the name given in the corresponding cell in the second column. (This fixes the case of SVG elements that are not all lowercase.)
Tag name Element name altglyphaltGlyphaltglyphdefaltGlyphDef… …
This seems to adjust for a difference between an HTML tag name (case insensitive) and an SVG element name. This adjustment is important as a consideration of tree construction where HTML tokens are transformed into elements in a tree.
At the moment, this roughly corresponds to Node.nodeName for elements (and Element.tagName). If we make this change, then it becomes an arbitrary decision.
And to state the obvious, it should be trivial for consuming code to lowercase tag names if desired.
if we want to consider the change it'd be best to do before 6.7 is released, as after that it would be a backwards-compatibility break
Definitely.
It doesn't seem more correct to me to lowercase HTML tag names
It's less about being correct and more about expectations. I think if you survey a bunch of people and ask them how they feel about turning <p><a><span> into <P><A><SPAN> they will have opinions about that. Also, we can survey HTML in the wild and see what a sampling of global expectations might be, given the prevalence of the styling.
Here is a survey from my list of ~300k HTML pages.
| Type of tag | Count | Percent |
|---|---|---|
| ALL UPPER | 307,499 | 2.7% |
| all lower | 12,388,675 | 96% |
| Mixed Case | 221,307 | 1.7% |
My point in sharing these numbers isn't to say they dictate what we do; just noting that the overwhelming majority of HTML out there is using lower-case tag names and people have grown accustomed to them.
In the case of normalization, this is the default behavior, which is why I care about it.
But also going back in time, the reason I remember for introducing these functions was just to ensure that the html5lib tests pass which check against the adjusted foreign content tag names and attribute names. I don't feel these have a central important role in the spec compliance.
Your point is sound: it's trivial for calling code to lower-case-fold the tag names. Except, then they also have to remember to only do that for elements in the HTML namespace and not to do so for foreign elements. That leaves calling code calling this function and then immediately asking if it's an HTML element and then lower-casing.
$tag_name = $this->get_qualified_tag_name();
if ( 'html' === $this->get_namespace() ) {
$tag_name = strtolower( $tag_name );
}
maybe I had a gut reaction since get_qualified_tag_name() just made the exact same namespace check before returning, and it felt like spreading out the same semantic between the inside and outside of that method.
I was going to close this, but I think I'll leave it open at least a little longer to continue pondering.
For general interest: here is the list of all-caps and mixed-case tag names from my survey. The list includes tag closers, and I didn't attempt to check if the closer casing matched the opening casing. Obvious HTML errors are evident, especially in the list of once-seen tags.
The list
ALL UPPER TAGS
65,179: A
30,621: BR
28,545: TD
23,392: FONT
18,875: P
15,786: IMG
15,154: B
13,723: TR
13,142: LI
10,102: OPTION
6,578: META
5,069: TABLE
4,701: HEAD
4,699: HTML
4,658: TITLE
4,576: BODY
4,137: I
4,008: ID
3,945: EM
3,476: H1
3,167: CENTER
2,532: HR
2,351: SPAN
1,654: INPUT
1,529: STRONG
1,515: UL
1,098: AREA
990: SCRIPT
853: H2
844: O:P
792: H3
785: WBR
777: DIV
740: LINK
738: TH
661: MENU
620: BLOCKQUOTE
610: TBODY
418: U
400: FORM
379: H4
318: SMALL
270: STYLE
266: FRAME
200: PRE
186: ADDRESS
174: BIG
168: MAP
166: PARAM
151: SELECT
142: FRAMESET
124: TT
94: SPACER
92: ABBR
88: CITE
81: SUP
78: NOBR
72: NOSCRIPT
63: H6
56: BASE
52: H5
51: COL
49: IFRAME
36: COLGROUP
35: DD
34: NOFRAMES
33: DT
30: OBJECT
28: EMBED
27: SUMMARY
25: MARQUEE
25: SUB
22: HEADER
22: LAYER
20: EC
19: BASEFONT
19: OL
15: FIGURE
14: NAV
13: DFN
12: V:F
12: ARTICLE
12: ILAYER
11: SECTION
11: X-CLARIS-WINDOW
11: X-CLARIS-TAGVIEW
11: CAPTION
11: BUTTON
9: THEAD
9: FIGCAPTION
8: O
8: ZBLINK
8: LABEL
7: CODE
7: DIR
7: LH
7: AUDIO
6: BLINK
6: STYLE='MSO-BIDI-FONT-WEIGHT:
5: X-SAS-WINDOW
5: BGSOUND
4: FOOTER
4: TEXTAREA
4: ALIGN=LEFT
4: X-CLARIS-REMOTESAVE
4: NOINDEX
3: APPLET
3: LEFT
3: BOLD
3: KBD
3: SP
2: Y
2: E=
2: C
2: NÍ
2: MARK
2: STRIKE
2: WSJ
2: NOLAYER
2: ALIGN="LEFT"
2: INSERT_COUNT*
2: BD
2: NOFRAME
2: INS
2: LNAME
2: FNAME
2: URL
2: ACRONYM
Once-seen tags: A,, AAREA, ALIGN="RIGHT", ALIGN=CENTER, ASIDE, B<P, BGCOLOR="#000000", BOTTOM, BR<B, BRïï, CENTRE, CFINCLUDE, CLEAR, COLDEF, COLDEFS, CONNECTED,PREFERRED, CSACTION, CSACTIONDICT, CSACTIONITEM, CSACTIONS, CSSCRIPTDICT, CUFON, CUFONCANVAS, CUFONTEXT, DL, DOC, DOCTYPE, EF_B_RED, FIELDSET, GCSE:SEARCH, GCSE:SEARCHBOX-ONLY, H, HEADS_TAG, HRNOSHADE, HTM, HTML!, IF_ERRORPARAM, IF_ERRORSTR, IF_ERRORTYPE, INSERTFLASHHEAD, INSTITUTE, JSON, LEGEND, LI,<A, MAJOR, METANAME="DESCRIPTION", METANAME="KEYWORDS", NAME, NOAUTOLINK, NOF, O:LOCK, OCCUPATION, ONTOLOGY, ROWS, SAMP, SPOUSE, T, TAIL, TILTE, TIME, U7:P, UNION_TAG_INDEX_FOOTER, UNION_TAG_INDEX_HEADER_1, UNION_TAG_INDEX_HEADER_2, UNION_TAG_INDEX_TITLE, UP-21, V:FORMULAS, V:PATH, V:SHAPETYPE, V:STROKE
Mixed Tags
32,047: Key
32,041: Contents
32,041: LastModified
32,041: ETag
32,041: Size
28,041: StorageClass
4,007: Owner
4,004: DisplayName
4,000: Generation
4,000: MetaGeneration
1,578: feColorMatrix
1,466: feComposite
1,390: feComponentTransfer
1,390: feFuncA
1,387: feFuncR
1,387: feFuncG
1,387: feFuncB
807: linearGradient
680: clipPath
459: Error
459: Code
459: Message
380: RequestId
379: Option
373: HostId
155: feGaussianBlur
125: feBlend
122: feOffset
111: bR
91: Br
80: tD
79: feFlood
75: Td
74: feMergeNode
71: Font
64: o:SmartTagType
49: textPath
48: st1:State
45: Meta
43: radialGradient
41: tR
38: animateTransform
38: ListBucketResult
38: Name
38: Prefix
38: Marker
38: IsTruncated
37: feMerge
37: st1:City
34: MaxKeys
33: Tr
31: Center
31: Table
30: rdf:RDF
30: asp:ListItem
29: Img
28: feMorphology
26: Strong
26: foaf:givenName
26: foaf:familyName
25: QueryParameterName
25: QueryParameterValue
25: Reason
24: st1:PlaceName
23: AccountName
23: cc:Work
21: Script
21: Title
20: RecommendDoc
19: st1:PlaceType
19: class="Text"
16: Li
13: noScript
13: psi:contextVar
13: contBox-x
12: Input
11: u51:SmartTagType
10: Details
10: Body
10: ItemTemplate
10: u46:SmartTagType
10: u48:SmartTagType
9: Dd
9: foreignObject
9: Th
9: psi:queryVar
9: u26:SmartTagType
9: u40:SmartTagType
9: u45:SmartTagType
9: u52:SmartTagType
9: u53:SmartTagType
9: st1:Street
8: u28:SmartTagType
8: u29:SmartTagType
8: u31:SmartTagType
8: u42:SmartTagType
8: u47:SmartTagType
8: u49:SmartTagType
8: u54:SmartTagType
8: u55:SmartTagType
8: Button
8: asp:RequiredFieldValidator
8: asp:TextBox
7: color="#CC0000"
7: MainOrArchivePage
7: rdf:Description
7: u23:SmartTagType
7: u24:SmartTagType
7: u27:SmartTagType
7: u33:SmartTagType
7: u36:SmartTagType
7: u37:SmartTagType
7: u43:SmartTagType
7: Head
7: color=#fFee00
6: Event-Card-Open-Close-Toggle
6: HFBusiness
6: u1:SmartTagType
6: u4:SmartTagType
6: u25:SmartTagType
6: u30:SmartTagType
6: u34:SmartTagType
6: u35:SmartTagType
6: u38:SmartTagType
6: u39:SmartTagType
6: u41:SmartTagType
6: u44:SmartTagType
6: u50:SmartTagType
6: u57:SmartTagType
6: u58:SmartTagType
6: u60:SmartTagType
6: u62:SmartTagType
6: u63:SmartTagType
6: u64:SmartTagType
6: u65:SmartTagType
6: u66:SmartTagType
6: u67:SmartTagType
6: u68:SmartTagType
6: u69:SmartTagType
6: iconSm-x
6: Select
6: asp:Panel
5: psi:sessionVar
5: u32:SmartTagType
5: u61:SmartTagType
5: st1:PostalCode
5: liNK
4: InLineReplace
4: Style
4: psi:sortOp
4: u7:SmartTagType
4: u8:SmartTagType
4: u9:SmartTagType
4: u10:SmartTagType
4: u11:SmartTagType
4: u12:SmartTagType
4: u17:SmartTagType
4: u21:SmartTagType
4: u22:SmartTagType
4: u56:SmartTagType
4: u59:SmartTagType
4: NextMarker
4: httpStatusCode
4: Form
3: Resource
3: ListAllMyBucketsResult
3: Buckets
3: Ul
3: u5:SmartTagType
3: u6:SmartTagType
3: u13:SmartTagType
3: u14:SmartTagType
3: u15:SmartTagType
3: u16:SmartTagType
3: u73:SmartTagType
3: u72:SmartTagType
3: u74:SmartTagType
3: toggleSection
3: Initial-scale=1.0"
3: Ozelliklerimiz<
3: BucketName
3: asp:Label
3: invalidTag
3: String
2: hR
2: xmpMM:DerivedFrom
2: QlÃIPq¼I3J]ߢ*5×¾¢GC
2: Wo|¥´bFôÈ®D:ýx3¨j8V~ùs¸xÑ,4P[\ô÷sDóÃ1#ð£y)F?|ù
2: k*H)ϼzÐâ5U%Oý
2: Z.¶seiÁ%<Aù¯~õÀZÇv¸¼ºXBË
är9KÇãào¼KNôT·2 ÛÁcÚFáÌú¾èMJ`.ôSÕûUÐÀ¡Õ7»H$2³Èhe¤þPEçûP°IBZ)R]
8R°ÊÆ2dd5æ
2: sÓ?©JhJ¢ÉéëcZzÖAÝ6âd£5èIå
2: H©õj&[زê³èsúòÆýÑ
2: NY´Ø&bTÉ9C¢q¼Vº
2: JÏSÖ!Ðr¼K9äy´è(J±Ä¤8xõ L&c1Ç
Ga [Yj*Ëf«:þ¹DWW0¨ÜÑ(HÔ
2: W°éÿÛ=ek¾6¹Æ^Äu
2: hX¼Wy2ûºfzÐ[*vÀq°«
2: KMx¢ÛY75)-¾=¤vw3ê;¿î=-Sç\»Ñ7Ëu\2J^S[¤C&CC-'ððÒ©ä}'é¯XLjÒ ðXuÂAÔdd5æ
2: Rñú9w1÷ñÿ\²þd=¡ùºÞ¿òel¿Ô³¤ø¥ãôrîcïcþ¹eüÈp~n·¯ü[
2: y+x¬û¤óéI¡*Y1"J+Å
2: Q·öÝm§o}Ý)eóVQú=QNY$GIËZÄ¥,LÌkÐPXh¾±ÂA#ÌÿUp$<3,$ûGë@Ü9@32e¯C@@æ8t<ÿhàÿy¿ÍÏú#´âÞþ
2: pñçîð²Ä²Zóöèh5ßÏê¶
2: i!?Æ
%µÀl[N&åoÖHA%$&G
2: WK¼bøÒ²ÿ³«úO=tãóÒî2ãm%ü[6*
=
2: ré*Fu,Ü0,¯g!j4VÉTKÝËÜ4LKgN?ØÎ¤§gѼ«o"(äFöEo¡÷»ã4Hî×·@ËNºPê"áGÓ5È¡[Lq
2: Zâè:P%ÞвXS1ó<e@ÊKôw,*ËË1Òs{¹¥
2: kXu¦!.\
2: AùCd»´{×ýÚà"CÁÂÒ0ë÷NG»
2: o'©SVVp
2: YÆIeUVë¦X«+M¦oòÖÞEöY{¸²Ö"ïb)¯Üo´4YZË
2: Bþ®J¼½+˹»EUÅUððV2¾UiÜcK¶PA(P*ÑE#
2: j«Ct³èîÖ{]òZ¦Ûk]Ï%%ËzT~
³îËÖÒÁ®Ú×µó÷ybmMqjê©»±²pÐÕBp_ß<À}õ)iBËNØ(ã4àÉÆêìl¤Ô¤êÛÑR¸VJ¬Ý[ëgHáÍÊë;hNqXb¢ÖUùM¿&K8ªuiMaf¶¼[A{ö"ÐÛÕÝ}Y
dës\ajwjêö¾u.¤F¥(CxR²JN
2: p»Jk<ð§=¡+üã£h²Ò8ÖäõM¼êú©*V¬ÆZX·qñ+,½ÿ5e5xHö+«ÙöÑX[Æ¿ºú
2: n£ØbðlÄÜÄCilüì]aûÛ÷ªÆâb·XZ°%À
jH*
2: Pmj¼Ï'Þ²dõCUÙz¢ª©ªèJ¦«@j
2: G£øg@]($÷GTJó V¼§øP}ãà(â÷iÊ
tñtRá\YtÝP*2
2: Aù¯våÀXpÙ~%*qà°à
2: F%{þöéyàO4ôéEs#¢=ó^G:²
2: bgcolor="#FFFFFF"
2: Valign="middle"
2: tdOCTYPE
2: u18:SmartTagType
2: u19:SmartTagType
2: u20:SmartTagType
2: u70:SmartTagType
2: u71:SmartTagType
2: AHREF="http:
2: Address
2: socketType
2: i:pgfRef
2: ServerTime
2: TraceId
2: Transition
2: font-face="Times
2: Werewolf
2: Civilian
2: feTurbulence
2: feDisplacementMap
2: Area
2: Html
2: headHTML
2: asp:HyperLink
2: asp:DropDownList
2: asp:CheckBox
Once-seen tags: AAddress, AHREF="Collections.html", AHREF="mailto:[email protected]", Basilica, BlockQuote, Bstyle="color:black;background-color:#ffff66", Bucket, CENTEeR, Content-Type:, Endpoint, Footer, GallerySection, Gblockquote, HREF="http:, Hostip, HostipLookupResultSet, Iframe, In, InstragramWidget, Left, Link, Mmeta, Numbertemplate, P,Here, Page, Rosa, Rotten, SCRILongDateAGE="JavaS0riSunday,, SetModTime, Span, Special, TABle, UdeM_menu, UnknownOperationException, Xlink, YX*~³;]<úæ
¹ÒÕ^ö¬ÖÕ'ÿÔwNJcZo^«èk\E£òâ»P\éäD~Ñ4Þ÷ä97
«ÆÂ ý, alt="El, aname="OLE_LINK2", aria-label="Instagram", aria-label="YouTube", asp:Button, asp:CustomValidator, asp:LinkButton, asp:Localize, asp:RegularExpressionValidator, asp:ValidationSummary, bDopo, clientConfig, cmLogo, com:Text, contPad, countryAbbrev, countryName, customHeaders, displayName, displayShortName, dnn:DnnCssInclude, emailProvider, fOnt, face="Verdana", font="C60000, f~ÐÙSBøh_ÝXÄÄÁ÷GV~uc¦ác{3÷Övñ:B¹1°$æ1·¸°BÛ?YçÁkUHY6~ðå,GVð, gml:Null, gml:Point, gml:boundedBy, gml:featureMember, gml:pointProperty, google-site-verification=OgY2mih_AxAZzi7f8b33QTOGHoScolbNOE6aTlqold0, httpProtocol, incomingServer, ipLocation, jA|ó+§ðü©cÈå,¼²e[WId2ùSÂë#ã§Rxáô¾~æBµÕÞôqm#ºÃ×ÐÒÂ
§¡¶*ó7Qh2Ré, link="#0000FF", mes:Error, mes:ErrorMessage, meta name="description" content="Evo, meta name="keywords" content="Bike, meta property="og:description" content="Rent, navBarCenter, ns0:City, ns0:PlaceName, ns0:PlaceType, ns0:State, onMouseover="over_effect(event,'outset')", outgoingServer, pYes,, psi:ZOTERO_COinS, psi:iktList, psi:sortOptions, quicklinkComp, skippedTag, title="Fairfield, titleB, toggleCont, togglePad, xáP¾~æ$
Found 12,917,481 tags
of those:
ALL UPPER: 307,499 (2.38%)
all lower: 12,388,675 (95.906%)
Mixed Case: 221,307 (1.713%)
@sirreal want to examine this again and consider it?
I still don’t have strong feelings about it, but the more I do with constructive uses of the HTML API, the more I like having functions like this available for things like serialize_token() and for doing things like closing the stack of open elements.
it’s mostly nice for the SVG/MathML elements requiring mixed case, but convenient to not have to export strtolower() everywhere, and especially when that gets mixed with a conditional stack checking a bunch of attributes about the element to determine if it should be special and mixed-case.
I've been going back and forth. I want to make a coherent decision here.
As implemented, this would print lower case tag names for HTML and MATH elements, but for SVG tags it will print lower or kebab case —e.g. path and foreignObject— as described in the specification.
In trunk, the method roughly corresponds to Node.nodeName for elements (and Element.tagName).
With this change, it corresponds to Element.localName. I noticed that the Chrome devtools use this scheme for printing tag names in the "Elements panel" and discovered it uses localName. I do like that this corresponds to an existing concept of localName and it's not an arbitrary mixed casing decision particular to the HTML API.
I'm not opposed to this, however I'm not sure what benefit there is to printing a few SVG tags with kebab case. They're not treated any differently, and I suspect the vast majority of web developers would lower case the SVG tags as well.
My big question is, why not always use strtolower( $processor->get_tag() ) for printing? It's probably closer to what developers expect.
what benefit there is to printing a few SVG tags with kebab case … why not always use strtolower( $processor->get_tag() ) for printing?
this probably goes back to XML being case sensitive whereas HTML is not, but here we have something that’s mostly like XML being embedded within HTML. Also, the tag names in the XHTML and MathML namespace are all lowercase, so SVG remains unique.
It makes me wonder if there’s relevance here with safe SVG handling, or export to XML. None of this is particularly decisive, but I tried an experiment with SVG tag names and also with attribute names, thinking along the same lines. Below are the source and a render from Safari of proper and lower casing. We can see that when embedded it has no bearing, but when provided as an external document or when enclosed as a data URI it does impact the render.
I count 39 mixed-case tag names and 53 mixed-case attribute names. I wonder if it would be worth incorporating these. If we really don’t like them here we can put it all inside of serialize_token(), but then again I kind of like that function remaining focused on what it does.
@sirreal ~what if we thought of a new method called something like get_local_name() and get_local_attribute_names_with_prefix() which would mirror the existing two functions, but which wouldn’t auto-complete as the default thing when looking for tag or name? I would love it if we can retain the aspect of the HTML API which is that it pushes for upper-cased tag names in source code (which makes it somewhat easier to identify and search for code working with them).~ On more review this is kind of exactly what the get_qualified_tag_name() and get_qualified_attribute_name() are for, we just aren’t exporting the lower-cased variants for HTML and MathML, which we might expect.
I now think that it could be appropriate here to only lower-case normative HTML elements. Custom elements likely should have their casing preserved. Though I don’t know what to do about unknown HTML elements that are not custom elements, like <nonexistent>. Technically that’s neither a custom element nor an HTML element. It looks like HTML, so maybe lower-case it? (leaving room for expansion of the HTML tag set in the future).
Though I don’t know what to do about unknown HTML elements that are not custom elements, like
<nonexistent>. Technically that’s neither a custom element nor an HTML element. It looks like HTML, so maybe lower-case it?
Wouldn't it be an HTML element? (Aren't custom elements also technically HTML elements?)
It would be handled by the "any other start/end tag" rules, the start rule is to:
Insert an HTML element for the token.
If we inspect in the browser, the element's namespaceURI is http://www.w3.org/1999/xhtml.
Unless, of course, it's being parsed in foreign content, in which case the unknown element seems to inherit the namespace.
I think all the namespacing is handled correctly by the HTML Processor so I don't think unknown or custom elements should require special handling.
Wouldn't it be an HTML element?
Yes of course but I wasn’t talking about namespaces. I was contrasting the fact that we have distinct custom elements which behave differently than the set of tags defined as HTML elements. We could say that something like <nonexistent> is a potential future defined HTML element in the same way that <selectedcontent> was not defined but now is.
So I’m not trying to delineate in which namespace these belong but whether we should be applying the rules for custom elements to them since they are indeed custom.
In writing this it seems strange to do anything other than lowercase them. I don’t know why I thought we should handle them differently.
I’ll get back to this soon and fully review everything that was previously uncertain.