nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

refactor: relink_namespaces handles attributes properly

Open flavorjones opened this issue 4 years ago • 6 comments
trafficstars

What problem is this PR intended to solve?

See #1790 for context.

Skip relinking namespaces in HTML4/5 documents

Also fix both implementations of relink_namespaces to not skip attributes when the node itself doesn't have a namespace.

Finally, the private method XML::Node#add_child_node_and_reparent_attrs is no longer needed after fixing relink_namespaces, so this commit essentially reverts 9fd03d8.

Have you included adequate test coverage?

I've added tests for HTML reparenting to ensure namespaces aren't linked.

Does this change affect the behavior of either the C or the Java implementations?

Both implementations are kept in sync with each other here.

flavorjones avatar Aug 14 '21 16:08 flavorjones

I've added a commit that introduces test coverage for the HTML5 behavior from 7b3c972. This currently fails, will make it pass shortly.

flavorjones avatar Aug 14 '21 16:08 flavorjones

Should be green now, I'm curious what you think @stevecheckoway

flavorjones avatar Aug 14 '21 16:08 flavorjones

Hey @flavorjones, first let me apologize if I've missed other PRs or issues in which you've mentioned me. I need to get my GitHub settings squared away. Every time one of my students creates a repository (so once per assignment), GitHub adds it to my watch list. I'm apparently watching 1300 repos right now! The emails are overwhelming.

Namespaces in HTML documents is an area that I'm really pretty weak on. I found the HTML standard confusing and I haven't taken the time to dig in to what browsers do. HTML documents have nodes that live in one of six namespaces. HTML elements live in the HTML namespace, MathML elements live in the MathML namespace, and SVG elements live in the SVG namespace. The XLink, XML, and XMLNS namespaces are for attributes.

During parsing, math and svg start tags cause math and svg elements to be inserted in the MathML and SVG namespaces, respectively. Several places require attributes on foreign (i.e., not HTML) elements to adjust attributes such that some end up in a namespace. Normally, HTML attributes should not have namespaces. (I have no idea if we handle that correctly or not currently.)

Here's an example showing what I think should happen.

<!DOCTYPE html>
<svg viewBox="0 0 200 100" xmlns="http://www.w3.org/2000/svg">
  <title><span xml:lang="en-US">title</span></title>
  <text xml:lang="en-US" foo:bar="blah">This is some English text</text>
</svg>
<p xml:lang="en-US">More text.
  • The (parser-inserted) html, head, and body elements are in the HTML namespace with null prefixes.
  • The svg element is in the SVG namespace with null prefix.
  • The viewBox attribute has no namespace.
  • The xmlns attribute is in the XMLNS namespace with null prefix.
  • The title element is in the SVG namespace with null prefix.
  • The span element is in the HTML namespace with null prefix
  • The span's xml:lang attribute has local name xml:lang and no namespace.
  • The text element is in the SVG namespace with null prefix.
  • The text's xml:lang attribute is in the XML namespace with prefix xml and local name lang.
  • The text's foo:bar attribute has local name foo:bar and no namespace.
  • The p element is in the HTML namespace with null prefix.
  • The p's xml:lang attribute has local name xml:lang and no namespace.

I've bolded the parts that might be surprising.

As you can see, it's quite complicated. Some foreign elements, and only in some circumstances (namely when one of these conditions is satisfied), can contain HTML elements whose attributes never have namespaces and can contain colons. And only some attributes with colons in the name in foreign elements end up with namespaces.

(It's probably worth noting that this impacts XPath and XSLT.)

So the question becomes what behavior should the Nokogiri API have when adding nodes whose attributes contain colons?

As a point of comparison, I played around with JavaScript a little bit. If I create a 'svg' element, it puts it in the HTML namespace unless I explicitly tell it the namespace to use. If I create an attribute with the name xml:lang, it treats it as an attribute with no namespace. If I add that attribute to the svg element in the SVG namespace, it doesn't adjust the attributes.

element = document.createElementNS("http://www.w3.org/2000/svg", "svg");
element.namespaceURI; // "http://www.w3.org/2000/svg"
attr = document.createAttribute("xml:lang");
attr.localName; // "xml:lang"
attr.prefix; // null
attr.namespaceURI; // null
element.setAttributeNode(attr)
element.attributes[0].localName; // "xml:lang"
element.attributes[0].prefix; // null
element.attributes[0].namespaceURI; // null

I'd be inclined to follow that behavior. If you want an element/attribute in a namespace, you gotta construct it in that namespace.

stevecheckoway avatar Aug 14 '21 21:08 stevecheckoway

@stevecheckoway Thanks for the context dump, and no need to apologize for notification fatigue.

I'm going to take some time to read everything you wrote and make sure I understand it, because my lack of knowledge of HTML5 is showing.

flavorjones avatar Aug 16 '21 12:08 flavorjones

reviewed

aditiagarw4722 avatar Feb 03 '22 16:02 aditiagarw4722

Rebased onto currrent main

flavorjones avatar Aug 30 '22 16:08 flavorjones