nokogiri
nokogiri copied to clipboard
Namespace behavior during reparenting should match XSLT's `copy-of` semantics
Not all the namespaces referenced inside a XML node are carried over by Node#add_child
.
Reproducible testcase: https://gist.github.com/gioele/2c88ac73f4f28f79fbc6#file-add_child_ns-rb
Output:
== Original document
<?xml version="1.0"?>
<wrapper xmlns="ns" xmlns:extra="extra">
<record xml:id="r1">
<field>aaa</field>
<field extra:type="second">bbb</field>
</record>
</wrapper>
== New document
== The attribute `type` should be in namespace `extra`, but it is not
<?xml version="1.0"?>
<summary xmlns="summary">
<default:record xmlns:default="ns" xmlns="ns" xml:id="r1">
<field>aaa</field>
<field type="second">bbb</field>
</default:record>
</summary>
Tested with Ruby 2.1.2 and Nokogiri 1.6.3.1
BTW, I discovered this while searching for ways to create a standalone document from a XML node.
Hello Gioele,
Have you tried cloning the object before you called add_child
?
Does this fix it for you?
https://gist.github.com/kbrock/468f9f5e987e3b60155b/revisions
--Keenan
Hi Keenan, thanks for the tip. Yes, cloning the node before adding it fixed the problem. Was the problem my fault ("nodes are supposed to be cloned before added to other documents", but I cannot find any reference to this in the Nokogiri documentation) or a bug in Nokogiri?
What about the depth of the cloning? When using the default depth is not enough?
Hi @gioele, I think libxml2 has trouble with a single node being in 2 documents. Each document has a different namespace objects, even though the objects happen to have the same prefix and uri.
Personally, I was adding a node with children. The namespace of the parent node said default, but the child nodes were all correct. This was fixed with dup
or clone
. I don't think you need to be concerned with depth.
Best of luck, --Keenan
@kbrock , thank you very much for your reply. It resolves the same issue that I had. But issue is much worse because even adding nodes from same document can be broken if namespace is not defined in a parent element where node is being added. Sounds confusing. I modified original gist to create a reproducer: https://gist.github.com/akostadinov/367ce3855289cdb9bd3d83896f843b0f
It would be very nice if somebody can file an upstream libxml2
bug so that this is fixed. Ideally it should be handles properly on the lower level. Can't imagine anybody to want current behavior.
wrt nokogiri:
- it needs to be documented somewhere - that
dup
is safer unless one knows what he/she is doing - if possible detect with nokogiri that all namespaces would be defined when adding a node as a child and dup it if necessary
I'd like very much if @kbrock or @akostadinov would write a failing test case for the issue being described. A script isn't much use if a) I don't know what output you actually saw, nor b) what output you expected.
Please write a failing test to help me understand and fix the issue.
@flavorjones I converted my gist to use a comparison. Does this work for you?
#!/usr/bin/env ruby
require 'nokogiri'
src =<<EOD
<wrapper xmlns="ns" xmlns:extra="extra">
<record xml:id="r1">
<field>aaa</field>
<field extra:type="second">bbb</field>
</record>
</wrapper>
EOD
tgt =<<EOD
<?xml version="1.0"?>
<summary xmlns="summary">
<record xmlns="ns" xmlns:extra="extra" xml:id="r1">
<field>aaa</field>
<field extra:type="second">bbb</field>
</record>
</summary>
EOD
src_doc = Nokogiri::XML(src)
record = src_doc.at('//base:record', {'base' => "ns"})
dest_doc = Nokogiri::XML("<summary xmlns='summary'/>")
# FIX: dest_doc.root.add_child(record.clone)
dest_doc.root.add_child(record)
puts "oh no" if dest_doc.to_xml != tgt
@kbrock Great! Thank you for providing executable code. Now we can have a conversation!
Your test expects that the record
element, which in the original document only references namespaces, will have namespace definitions when it's reparented into a new document. In particular, the existence of a definition for xmlns:extra
is surprising to me. I'd like to poke at that assumption.
If I made similar node moves within the same document, what would your expectation be? Specifically, this code:
#!/usr/bin/env ruby
require 'nokogiri'
src =<<EOD
<root>
<wrapper xmlns="ns" xmlns:extra="extra">
<record xml:id="r1">
<field>aaa</field>
<field extra:type="second">bbb</field>
</record>
</wrapper>
<summary xmlns="summary">
</summary>
</root>
EOD
doc = Nokogiri::XML(src)
record = doc.at('//base:record', {'base' => "ns"})
dest_node = doc.at('//summary:summary', {'summary' => "summary"})
dest_node.add_child(record)
puts doc.to_xml
What do you expect the record
element to look like when it's reparented into summary
?
Here's what Nokogiri (libxml2) does as of 1.7.0.1 / 2.9.4:
<?xml version="1.0"?>
<root>
<wrapper xmlns="ns" xmlns:extra="extra">
</wrapper>
<summary xmlns="summary">
<record xml:id="r1">
<field>aaa</field>
<field extra:type="second">bbb</field>
</record></summary>
</root>
That is, the reparented node (originally using its parent node's default namespace) inherits the parent node's default namespace.
puts record.inspect
#<Nokogiri::XML::Element:0xb9193c name="record" namespace=#<Nokogiri::XML::Namespace:0xb8c220 href="summary"> attributes=[#<Nokogiri::XML::Attr:0xb9175c name="id" namespace=#<Nokogiri::XML::Namespace:0xb910cc prefix="xml" href="http://www.w3.org/XML/1998/namespace"> value="r1">] children=[#<Nokogiri::XML::Text:0xb90780 "\n ">, #<Nokogiri::XML::Element:0xb90500 name="field" namespace=#<Nokogiri::XML::Namespace:0xb8c220 href="summary"> children=[#<Nokogiri::XML::Text:0xb8fe5c "aaa">]>, #<Nokogiri::XML::Text:0xb8fad8 "\n ">, #<Nokogiri::XML::Element:0xb8f9ac name="field" namespace=#<Nokogiri::XML::Namespace:0xb8c220 href="summary"> attributes=[#<Nokogiri::XML::Attr:0xb8f8a8 name="type" namespace=#<Nokogiri::XML::Namespace:0xb8f268 prefix="extra" href="extra"> value="second">] children=[#<Nokogiri::XML::Text:0xb8e930 "bbb">]>, #<Nokogiri::XML::Text:0xb8e4e4 "\n ">]>
This seems correct to me, do you agree? I think it would be surprising behavior to define a namespace where it was only referencing one in its initial position in the document.
If you disagree, I'd like to understand what you think the appropriate behavior should be.
If you agree, then I'd like to understand why this behavior should be different when the node is moved to a new, separate document.
Also, please keep in mind that the XML spec is pretty much silent on the topic of moving nodes around -- especially so about moving nodes between documents -- so we're all really fumbling around in the dark. I would like the behavior we implement to be consistent, so please do let me know if you feel Nokogiri (or libxml2) is behaving inconsistently.
@flavorjones , I (possibly we) expect when node is added, to not be at the same time removed from it's original place. This is one thing that I understand might be hard to change at this state of the project without breaking too much existing code. It can be documented and would be mostly fine in my opinion.
But if we add node from another doc to this one (or move node to a different place in same doc), but namespace is not defined in the new place (or defined to something else), then obviously the node changed by the move operation. A different namespace of element is actually a different element. For this I certainly can't imagine a valid use case and can't imagine anybody being happy with or expecting it. Moving or adding, I believe everybody would expect the node at it's new home to have all necessary namespaces defines such that same xpath for example can find it.
@akostadinov We'll have to agree to disagree.
Project owners are the masters obviously. It would be interesting to know though what valid use cases you have for moving a node while namespaces are not retained or changed in an undefined way?
redoing this:
@flavorjones lets focus on namespaces, not on whether it was removed from the source.
I expect the namespaces to stay the same. expect extra
to be as is and not lost, and field/record to be in the ns
namespace.
Would you have expected the clone
to act differently in this case?
@kbrock Again, revisiting the difference between referencing a namespace and defining a namespace, the record
node referenced the namespace that is defined in the wrapper node. It has no meaning in the new context, particularly given that its own namespace is the default namespace from its parent. My point, really, is that any rationalization either way is not going to be based in anything in the XML spec. I picked one way; you would have picked the other. (Though I will note that the way you indicate makes sense to you would require Nokogiri to define a new namespace where one did not exist before, which makes it slightly more invasive.)
I'm not sure at this point whether continuing this conversation is at all constructive. We adopted some conventions around how reparenting nodes should work years ago; we tried to make it as consistent as we could; we may have failed. And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.
I have reasons for why it works the way it does; I'm not sure it's constructive to go into those reasons, given we all seem pretty far apart on this issue.
And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.
Maybe we could follow the XSLT lead, where node always carry all their namespace information available to them (definitions, references, defaults) regardless of where they are copied to with <copy>
and <copy-of/>
. XSLT processors take care of renaming xmlns prefixes so that the resulting document fragment is 100% equivalent to the original (even though it may have a different serialized form).
For reference: https://www.w3.org/TR/xslt20/#element-copy-of
The new element will also have namespace nodes copied from the original element node, unless they are excluded by specifying
copy-namespaces="no"
.
I believe this is what we're doing, to the best of the ability of the underlying parsers.
On Feb 14, 2017 4:08 AM, "Gioele" [email protected] wrote:
And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.
Maybe we could follow the XSLT lead, where node always carry all their namespace information available to them (definitions, references, defaults) regardless of where they are copied to with
and . XSLT processors take care of renaming xmlns prefixes so that the resulting document fragment is 100% equivalent to the original (even though it may have a different serialized form). For reference: https://www.w3.org/TR/xslt20/#element-copy-of
The new element will also have namespace nodes copied from the original element node, unless they are excluded by specifying copy-namespaces="no".
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1200#issuecomment-279648755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAgDyPCHq_oDBYRpt6jIjyvKcQq0O3iks5rcW72gaJpZM4DAkBC .
Actually: I'd really like to understand what happens to namespaces if we use xslt to move the node in the original example. That might be a good precedent to look to. I'm definitely open to behavior changes here.
On Feb 14, 2017 7:13 AM, "Mike Dalessio" [email protected] wrote:
I believe this is what we're doing, to the best of the ability of the underlying parsers.
On Feb 14, 2017 4:08 AM, "Gioele" [email protected] wrote:
And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.
Maybe we could follow the XSLT lead, where node always carry all their namespace information available to them (definitions, references, defaults) regardless of where they are copied to with
and . XSLT processors take care of renaming xmlns prefixes so that the resulting document fragment is 100% equivalent to the original (even though it may have a different serialized form). For reference: https://www.w3.org/TR/xslt20/#element-copy-of
The new element will also have namespace nodes copied from the original element node, unless they are excluded by specifying copy-namespaces="no" .
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1200#issuecomment-279648755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAgDyPCHq_oDBYRpt6jIjyvKcQq0O3iks5rcW72gaJpZM4DAkBC .
This is what add_child
+ clone
does, not what add_child
alone does.
As I user, I'd expect the add_child
to preserve all the namespace information. So add_child
should should act as add_child
+ clone
in case the node comes from another document. I see it as an implementation detail. Maybe I am I spoiled user. ;)
I think this issue should be changed only once a change has been done: either a change in the implementation of add_child
or a change in the documentation to add a note about this (for me) unexpected behavior.
Here's what XLST does, and I'll preface this by saying that I had no idea this was the behavior ...
#!/usr/bin/env ruby
require 'nokogiri'
xml = <<EOX
<root xmlns:root="http://common.parent.ns">
<orig-parent xmlns="orig-parent.default.ns"
xmlns:foo="http://orig-parent.foo.ns"
xmlns:bar="http://orig-parent.bar.ns"
xmlns:unused="http://orig-parent.unused.ns"
>
<root:child>
ns is common.parent.us
<bar:grandchild/>
</root:child>
<child>
ns is orig-parent.default.ns
<bar:grandchild/>
</child>
<foo:child>
ns is orig-parent.foo.ns
<bar:grandchild/>
</foo:child>
<bar:child>
ns is orig-parent.bar.ns
<bar:grandchild/>
</bar:child>
</orig-parent>
<final-parent-with-no-default-ns>
<replace-me/>
</final-parent-with-no-default-ns>
<final-parent-with-same-default-ns xmlns="orig-parent.default.ns">
<replace-me/>
</final-parent-with-same-default-ns>
<final-parent-with-different-default-ns xmlns="http://final.parent.default.ns">
<replace-me/>
</final-parent-with-different-default-ns>
</root>
EOX
xslt1 = <<-EOX
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:root="http://common.parent.ns"
xmlns:orig-parent="orig-parent.default.ns"
xmlns:foo="http://orig-parent.foo.ns"
xmlns:bar="http://orig-parent.bar.ns"
xmlns:unused="http://orig-parent.unused.ns"
xmlns:final-parent="http://final.parent.default.ns"
>
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*" name="identity">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="//final-parent-with-no-default-ns/replace-me">
<xsl:copy-of select="//*[local-name()='child']"/>
</xsl:template>
<xsl:template match="//orig-parent:final-parent-with-same-default-ns/orig-parent:replace-me">
<xsl:copy-of select="//*[local-name()='child']"/>
</xsl:template>
<xsl:template match="//final-parent:final-parent-with-different-default-ns/final-parent:replace-me">
<xsl:copy-of select="//*[local-name()='child']"/>
</xsl:template>
</xsl:stylesheet>
EOX
orig_doc = Nokogiri::XML xml
ss = Nokogiri::XSLT xslt1
final_doc = ss.transform orig_doc
puts final_doc.to_xml
output:
<?xml version="1.0"?>
<root xmlns:root="http://common.parent.ns">
<orig-parent xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
<root:child>
ns is common.parent.us
<bar:grandchild/></root:child>
<child>
ns is orig-parent.default.ns
<bar:grandchild/></child>
<foo:child>
ns is orig-parent.foo.ns
<bar:grandchild/></foo:child>
<bar:child>
ns is orig-parent.bar.ns
<bar:grandchild/></bar:child>
</orig-parent>
<final-parent-with-no-default-ns>
<root:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is common.parent.us
<bar:grandchild/></root:child>
<child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.default.ns
<bar:grandchild/></child>
<foo:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.foo.ns
<bar:grandchild/></foo:child>
<bar:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.bar.ns
<bar:grandchild/></bar:child>
</final-parent-with-no-default-ns>
<final-parent-with-same-default-ns xmlns="orig-parent.default.ns">
<root:child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is common.parent.us
<bar:grandchild/></root:child>
<child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.default.ns
<bar:grandchild/></child>
<foo:child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.foo.ns
<bar:grandchild/></foo:child>
<bar:child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.bar.ns
<bar:grandchild/></bar:child>
</final-parent-with-same-default-ns>
<final-parent-with-different-default-ns xmlns="http://final.parent.default.ns">
<root:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is common.parent.us
<bar:grandchild/></root:child>
<child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.default.ns
<bar:grandchild/></child>
<foo:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.foo.ns
<bar:grandchild/></foo:child>
<bar:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
ns is orig-parent.bar.ns
<bar:grandchild/></bar:child>
</final-parent-with-different-default-ns>
</root>
I think this behavior makes sense, and I'm open to adjusting Nokogiri's DOM node replacement behavior to match (though I worry that proper namespace-handling may be expensive, we'll have to see).
Thank you @flavorjones for the insightful and thorough example. I also wonder how expensive that would be to implement.