nokogiri
nokogiri copied to clipboard
refactor: relink_namespaces handles attributes properly
What problem is this PR intended to solve?
See #1790 for context.
Skip relinking namespaces in HTML4/5 documents
Also fix both implementations of relink_namespaces to not skip attributes when the node itself doesn't have a namespace.
Finally, the private method XML::Node#add_child_node_and_reparent_attrs is no longer needed after fixing relink_namespaces, so this commit essentially reverts 9fd03d8.
Have you included adequate test coverage?
I've added tests for HTML reparenting to ensure namespaces aren't linked.
Does this change affect the behavior of either the C or the Java implementations?
Both implementations are kept in sync with each other here.
I've added a commit that introduces test coverage for the HTML5 behavior from 7b3c972. This currently fails, will make it pass shortly.
Should be green now, I'm curious what you think @stevecheckoway
Hey @flavorjones, first let me apologize if I've missed other PRs or issues in which you've mentioned me. I need to get my GitHub settings squared away. Every time one of my students creates a repository (so once per assignment), GitHub adds it to my watch list. I'm apparently watching 1300 repos right now! The emails are overwhelming.
Namespaces in HTML documents is an area that I'm really pretty weak on. I found the HTML standard confusing and I haven't taken the time to dig in to what browsers do. HTML documents have nodes that live in one of six namespaces. HTML elements live in the HTML namespace, MathML elements live in the MathML namespace, and SVG elements live in the SVG namespace. The XLink, XML, and XMLNS namespaces are for attributes.
During parsing, math and svg start tags cause math and svg elements to be inserted in the MathML and SVG namespaces, respectively. Several places require attributes on foreign (i.e., not HTML) elements to adjust attributes such that some end up in a namespace. Normally, HTML attributes should not have namespaces. (I have no idea if we handle that correctly or not currently.)
Here's an example showing what I think should happen.
<!DOCTYPE html>
<svg viewBox="0 0 200 100" xmlns="http://www.w3.org/2000/svg">
<title><span xml:lang="en-US">title</span></title>
<text xml:lang="en-US" foo:bar="blah">This is some English text</text>
</svg>
<p xml:lang="en-US">More text.
- The (parser-inserted)
html,head, andbodyelements are in the HTML namespace with null prefixes. - The
svgelement is in the SVG namespace with null prefix. - The
viewBoxattribute has no namespace. - The
xmlnsattribute is in the XMLNS namespace with null prefix. - The
titleelement is in the SVG namespace with null prefix. - The
spanelement is in the HTML namespace with null prefix - The
span'sxml:langattribute has local namexml:langand no namespace. - The
textelement is in the SVG namespace with null prefix. - The
text'sxml:langattribute is in the XML namespace with prefixxmland local namelang. - The
text'sfoo:barattribute has local namefoo:barand no namespace. - The
pelement is in the HTML namespace with null prefix. - The
p'sxml:langattribute has local namexml:langand no namespace.
I've bolded the parts that might be surprising.
As you can see, it's quite complicated. Some foreign elements, and only in some circumstances (namely when one of these conditions is satisfied), can contain HTML elements whose attributes never have namespaces and can contain colons. And only some attributes with colons in the name in foreign elements end up with namespaces.
(It's probably worth noting that this impacts XPath and XSLT.)
So the question becomes what behavior should the Nokogiri API have when adding nodes whose attributes contain colons?
As a point of comparison, I played around with JavaScript a little bit. If I create a 'svg' element, it puts it in the HTML namespace unless I explicitly tell it the namespace to use. If I create an attribute with the name xml:lang, it treats it as an attribute with no namespace. If I add that attribute to the svg element in the SVG namespace, it doesn't adjust the attributes.
element = document.createElementNS("http://www.w3.org/2000/svg", "svg");
element.namespaceURI; // "http://www.w3.org/2000/svg"
attr = document.createAttribute("xml:lang");
attr.localName; // "xml:lang"
attr.prefix; // null
attr.namespaceURI; // null
element.setAttributeNode(attr)
element.attributes[0].localName; // "xml:lang"
element.attributes[0].prefix; // null
element.attributes[0].namespaceURI; // null
I'd be inclined to follow that behavior. If you want an element/attribute in a namespace, you gotta construct it in that namespace.
@stevecheckoway Thanks for the context dump, and no need to apologize for notification fatigue.
I'm going to take some time to read everything you wrote and make sure I understand it, because my lack of knowledge of HTML5 is showing.
reviewed
Rebased onto currrent main