nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

Namespace behavior during reparenting should match XSLT's `copy-of` semantics

Open gioele opened this issue 9 years ago • 18 comments

Not all the namespaces referenced inside a XML node are carried over by Node#add_child.

Reproducible testcase: https://gist.github.com/gioele/2c88ac73f4f28f79fbc6#file-add_child_ns-rb

Output:

== Original document
<?xml version="1.0"?>
<wrapper xmlns="ns" xmlns:extra="extra">
    <record xml:id="r1">
        <field>aaa</field>
        <field extra:type="second">bbb</field>
    </record>
</wrapper>

== New document
== The attribute `type` should be in namespace `extra`, but it is not
<?xml version="1.0"?>
<summary xmlns="summary">
  <default:record xmlns:default="ns" xmlns="ns" xml:id="r1">
        <field>aaa</field>
        <field type="second">bbb</field>
    </default:record>
</summary>

Tested with Ruby 2.1.2 and Nokogiri 1.6.3.1

BTW, I discovered this while searching for ways to create a standalone document from a XML node.

gioele avatar Nov 25 '14 18:11 gioele

Hello Gioele,

Have you tried cloning the object before you called add_child?

Does this fix it for you?

https://gist.github.com/kbrock/468f9f5e987e3b60155b/revisions

--Keenan

kbrock avatar Dec 10 '14 17:12 kbrock

Hi Keenan, thanks for the tip. Yes, cloning the node before adding it fixed the problem. Was the problem my fault ("nodes are supposed to be cloned before added to other documents", but I cannot find any reference to this in the Nokogiri documentation) or a bug in Nokogiri?

What about the depth of the cloning? When using the default depth is not enough?

gioele avatar Dec 11 '14 10:12 gioele

Hi @gioele, I think libxml2 has trouble with a single node being in 2 documents. Each document has a different namespace objects, even though the objects happen to have the same prefix and uri.

Personally, I was adding a node with children. The namespace of the parent node said default, but the child nodes were all correct. This was fixed with dup or clone. I don't think you need to be concerned with depth.

Best of luck, --Keenan

kbrock avatar Dec 11 '14 13:12 kbrock

@kbrock , thank you very much for your reply. It resolves the same issue that I had. But issue is much worse because even adding nodes from same document can be broken if namespace is not defined in a parent element where node is being added. Sounds confusing. I modified original gist to create a reproducer: https://gist.github.com/akostadinov/367ce3855289cdb9bd3d83896f843b0f

It would be very nice if somebody can file an upstream libxml2 bug so that this is fixed. Ideally it should be handles properly on the lower level. Can't imagine anybody to want current behavior.

wrt nokogiri:

  • it needs to be documented somewhere - that dup is safer unless one knows what he/she is doing
  • if possible detect with nokogiri that all namespaces would be defined when adding a node as a child and dup it if necessary

akostadinov avatar Jun 09 '16 09:06 akostadinov

I'd like very much if @kbrock or @akostadinov would write a failing test case for the issue being described. A script isn't much use if a) I don't know what output you actually saw, nor b) what output you expected.

Please write a failing test to help me understand and fix the issue.

flavorjones avatar Feb 10 '17 10:02 flavorjones

@flavorjones I converted my gist to use a comparison. Does this work for you?

#!/usr/bin/env ruby

require 'nokogiri'

src =<<EOD
<wrapper xmlns="ns" xmlns:extra="extra">
    <record xml:id="r1">
        <field>aaa</field>
        <field extra:type="second">bbb</field>
    </record>
</wrapper>
EOD

tgt =<<EOD
<?xml version="1.0"?>
<summary xmlns="summary">
  <record xmlns="ns" xmlns:extra="extra" xml:id="r1">
        <field>aaa</field>
        <field extra:type="second">bbb</field>
    </record>
</summary>
EOD

src_doc = Nokogiri::XML(src)
record = src_doc.at('//base:record', {'base' => "ns"})
dest_doc = Nokogiri::XML("<summary xmlns='summary'/>")
# FIX: dest_doc.root.add_child(record.clone)
dest_doc.root.add_child(record)

puts "oh no" if dest_doc.to_xml != tgt

kbrock avatar Feb 10 '17 20:02 kbrock

@kbrock Great! Thank you for providing executable code. Now we can have a conversation!

Your test expects that the record element, which in the original document only references namespaces, will have namespace definitions when it's reparented into a new document. In particular, the existence of a definition for xmlns:extra is surprising to me. I'd like to poke at that assumption.

If I made similar node moves within the same document, what would your expectation be? Specifically, this code:

#!/usr/bin/env ruby

require 'nokogiri'

src =<<EOD
<root>
  <wrapper xmlns="ns" xmlns:extra="extra">
    <record xml:id="r1">
      <field>aaa</field>
      <field extra:type="second">bbb</field>
    </record>
  </wrapper>
  <summary xmlns="summary">
  </summary>
</root>
EOD

doc = Nokogiri::XML(src)
record = doc.at('//base:record', {'base' => "ns"})
dest_node = doc.at('//summary:summary', {'summary' => "summary"})
dest_node.add_child(record)

puts doc.to_xml

What do you expect the record element to look like when it's reparented into summary?

Here's what Nokogiri (libxml2) does as of 1.7.0.1 / 2.9.4:

<?xml version="1.0"?>
<root>
  <wrapper xmlns="ns" xmlns:extra="extra">
    
  </wrapper>
  <summary xmlns="summary">
  <record xml:id="r1">
      <field>aaa</field>
      <field extra:type="second">bbb</field>
    </record></summary>
</root>

That is, the reparented node (originally using its parent node's default namespace) inherits the parent node's default namespace.

puts record.inspect
#<Nokogiri::XML::Element:0xb9193c name="record" namespace=#<Nokogiri::XML::Namespace:0xb8c220 href="summary"> attributes=[#<Nokogiri::XML::Attr:0xb9175c name="id" namespace=#<Nokogiri::XML::Namespace:0xb910cc prefix="xml" href="http://www.w3.org/XML/1998/namespace"> value="r1">] children=[#<Nokogiri::XML::Text:0xb90780 "\n      ">, #<Nokogiri::XML::Element:0xb90500 name="field" namespace=#<Nokogiri::XML::Namespace:0xb8c220 href="summary"> children=[#<Nokogiri::XML::Text:0xb8fe5c "aaa">]>, #<Nokogiri::XML::Text:0xb8fad8 "\n      ">, #<Nokogiri::XML::Element:0xb8f9ac name="field" namespace=#<Nokogiri::XML::Namespace:0xb8c220 href="summary"> attributes=[#<Nokogiri::XML::Attr:0xb8f8a8 name="type" namespace=#<Nokogiri::XML::Namespace:0xb8f268 prefix="extra" href="extra"> value="second">] children=[#<Nokogiri::XML::Text:0xb8e930 "bbb">]>, #<Nokogiri::XML::Text:0xb8e4e4 "\n    ">]>

This seems correct to me, do you agree? I think it would be surprising behavior to define a namespace where it was only referencing one in its initial position in the document.

If you disagree, I'd like to understand what you think the appropriate behavior should be.

If you agree, then I'd like to understand why this behavior should be different when the node is moved to a new, separate document.

Also, please keep in mind that the XML spec is pretty much silent on the topic of moving nodes around -- especially so about moving nodes between documents -- so we're all really fumbling around in the dark. I would like the behavior we implement to be consistent, so please do let me know if you feel Nokogiri (or libxml2) is behaving inconsistently.

flavorjones avatar Feb 13 '17 06:02 flavorjones

@flavorjones , I (possibly we) expect when node is added, to not be at the same time removed from it's original place. This is one thing that I understand might be hard to change at this state of the project without breaking too much existing code. It can be documented and would be mostly fine in my opinion.

But if we add node from another doc to this one (or move node to a different place in same doc), but namespace is not defined in the new place (or defined to something else), then obviously the node changed by the move operation. A different namespace of element is actually a different element. For this I certainly can't imagine a valid use case and can't imagine anybody being happy with or expecting it. Moving or adding, I believe everybody would expect the node at it's new home to have all necessary namespaces defines such that same xpath for example can find it.

akostadinov avatar Feb 13 '17 14:02 akostadinov

@akostadinov We'll have to agree to disagree.

flavorjones avatar Feb 13 '17 14:02 flavorjones

Project owners are the masters obviously. It would be interesting to know though what valid use cases you have for moving a node while namespaces are not retained or changed in an undefined way?

akostadinov avatar Feb 13 '17 14:02 akostadinov

redoing this:

@flavorjones lets focus on namespaces, not on whether it was removed from the source.

I expect the namespaces to stay the same. expect extra to be as is and not lost, and field/record to be in the ns namespace.

Would you have expected the clone to act differently in this case?

kbrock avatar Feb 13 '17 16:02 kbrock

@kbrock Again, revisiting the difference between referencing a namespace and defining a namespace, the record node referenced the namespace that is defined in the wrapper node. It has no meaning in the new context, particularly given that its own namespace is the default namespace from its parent. My point, really, is that any rationalization either way is not going to be based in anything in the XML spec. I picked one way; you would have picked the other. (Though I will note that the way you indicate makes sense to you would require Nokogiri to define a new namespace where one did not exist before, which makes it slightly more invasive.)

I'm not sure at this point whether continuing this conversation is at all constructive. We adopted some conventions around how reparenting nodes should work years ago; we tried to make it as consistent as we could; we may have failed. And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.

I have reasons for why it works the way it does; I'm not sure it's constructive to go into those reasons, given we all seem pretty far apart on this issue.

flavorjones avatar Feb 14 '17 02:02 flavorjones

And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.

Maybe we could follow the XSLT lead, where node always carry all their namespace information available to them (definitions, references, defaults) regardless of where they are copied to with <copy> and <copy-of/>. XSLT processors take care of renaming xmlns prefixes so that the resulting document fragment is 100% equivalent to the original (even though it may have a different serialized form).

For reference: https://www.w3.org/TR/xslt20/#element-copy-of

The new element will also have namespace nodes copied from the original element node, unless they are excluded by specifying copy-namespaces="no".

gioele avatar Feb 14 '17 09:02 gioele

I believe this is what we're doing, to the best of the ability of the underlying parsers.

On Feb 14, 2017 4:08 AM, "Gioele" [email protected] wrote:

And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.

Maybe we could follow the XSLT lead, where node always carry all their namespace information available to them (definitions, references, defaults) regardless of where they are copied to with and . XSLT processors take care of renaming xmlns prefixes so that the resulting document fragment is 100% equivalent to the original (even though it may have a different serialized form).

For reference: https://www.w3.org/TR/xslt20/#element-copy-of

The new element will also have namespace nodes copied from the original element node, unless they are excluded by specifying copy-namespaces="no".

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1200#issuecomment-279648755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAgDyPCHq_oDBYRpt6jIjyvKcQq0O3iks5rcW72gaJpZM4DAkBC .

flavorjones avatar Feb 14 '17 12:02 flavorjones

Actually: I'd really like to understand what happens to namespaces if we use xslt to move the node in the original example. That might be a good precedent to look to. I'm definitely open to behavior changes here.

On Feb 14, 2017 7:13 AM, "Mike Dalessio" [email protected] wrote:

I believe this is what we're doing, to the best of the ability of the underlying parsers.

On Feb 14, 2017 4:08 AM, "Gioele" [email protected] wrote:

And, as I said, the XML spec is silent on how reparenting nodes should work, particularly when moving between documents.

Maybe we could follow the XSLT lead, where node always carry all their namespace information available to them (definitions, references, defaults) regardless of where they are copied to with and . XSLT processors take care of renaming xmlns prefixes so that the resulting document fragment is 100% equivalent to the original (even though it may have a different serialized form).

For reference: https://www.w3.org/TR/xslt20/#element-copy-of

The new element will also have namespace nodes copied from the original element node, unless they are excluded by specifying copy-namespaces="no" .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1200#issuecomment-279648755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAgDyPCHq_oDBYRpt6jIjyvKcQq0O3iks5rcW72gaJpZM4DAkBC .

flavorjones avatar Feb 14 '17 12:02 flavorjones

This is what add_child + clone does, not what add_child alone does.

As I user, I'd expect the add_child to preserve all the namespace information. So add_child should should act as add_child + clone in case the node comes from another document. I see it as an implementation detail. Maybe I am I spoiled user. ;)

I think this issue should be changed only once a change has been done: either a change in the implementation of add_child or a change in the documentation to add a note about this (for me) unexpected behavior.

gioele avatar Feb 14 '17 12:02 gioele

Here's what XLST does, and I'll preface this by saying that I had no idea this was the behavior ...

#!/usr/bin/env ruby

require 'nokogiri'

xml = <<EOX
<root xmlns:root="http://common.parent.ns">
  <orig-parent xmlns="orig-parent.default.ns"
               xmlns:foo="http://orig-parent.foo.ns"
               xmlns:bar="http://orig-parent.bar.ns"
               xmlns:unused="http://orig-parent.unused.ns"
    >
    <root:child>
      ns is common.parent.us
      <bar:grandchild/>
    </root:child>

    <child>
      ns is orig-parent.default.ns
      <bar:grandchild/>
    </child>

    <foo:child>
      ns is orig-parent.foo.ns
      <bar:grandchild/>
    </foo:child>

    <bar:child>
      ns is orig-parent.bar.ns
      <bar:grandchild/>
    </bar:child>
  </orig-parent>

  <final-parent-with-no-default-ns>
    <replace-me/>
  </final-parent-with-no-default-ns>

  <final-parent-with-same-default-ns xmlns="orig-parent.default.ns">
    <replace-me/>
  </final-parent-with-same-default-ns>

  <final-parent-with-different-default-ns xmlns="http://final.parent.default.ns">
    <replace-me/>
  </final-parent-with-different-default-ns>
</root>
EOX

xslt1 = <<-EOX
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

                xmlns:root="http://common.parent.ns"
                xmlns:orig-parent="orig-parent.default.ns"
                xmlns:foo="http://orig-parent.foo.ns"
                xmlns:bar="http://orig-parent.bar.ns"
                xmlns:unused="http://orig-parent.unused.ns"
                xmlns:final-parent="http://final.parent.default.ns"
 >
  <xsl:output omit-xml-declaration="yes" indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()|@*" name="identity">
    <xsl:copy>
      <xsl:apply-templates select="node()|@*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="//final-parent-with-no-default-ns/replace-me">
    <xsl:copy-of select="//*[local-name()='child']"/>
  </xsl:template>

  <xsl:template match="//orig-parent:final-parent-with-same-default-ns/orig-parent:replace-me">
    <xsl:copy-of select="//*[local-name()='child']"/>
  </xsl:template>

  <xsl:template match="//final-parent:final-parent-with-different-default-ns/final-parent:replace-me">
    <xsl:copy-of select="//*[local-name()='child']"/>
  </xsl:template>

</xsl:stylesheet>
EOX

orig_doc = Nokogiri::XML xml
ss = Nokogiri::XSLT xslt1
final_doc = ss.transform orig_doc

puts final_doc.to_xml

output:

<?xml version="1.0"?>
<root xmlns:root="http://common.parent.ns">
  <orig-parent xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
    <root:child>
      ns is common.parent.us
      <bar:grandchild/></root:child>
    <child>
      ns is orig-parent.default.ns
      <bar:grandchild/></child>
    <foo:child>
      ns is orig-parent.foo.ns
      <bar:grandchild/></foo:child>
    <bar:child>
      ns is orig-parent.bar.ns
      <bar:grandchild/></bar:child>
  </orig-parent>
  <final-parent-with-no-default-ns>
    <root:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is common.parent.us
      <bar:grandchild/></root:child>
    <child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.default.ns
      <bar:grandchild/></child>
    <foo:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.foo.ns
      <bar:grandchild/></foo:child>
    <bar:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.bar.ns
      <bar:grandchild/></bar:child>
  </final-parent-with-no-default-ns>
  <final-parent-with-same-default-ns xmlns="orig-parent.default.ns">
    <root:child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is common.parent.us
      <bar:grandchild/></root:child>
    <child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.default.ns
      <bar:grandchild/></child>
    <foo:child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.foo.ns
      <bar:grandchild/></foo:child>
    <bar:child xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.bar.ns
      <bar:grandchild/></bar:child>
  </final-parent-with-same-default-ns>
  <final-parent-with-different-default-ns xmlns="http://final.parent.default.ns">
    <root:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is common.parent.us
      <bar:grandchild/></root:child>
    <child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.default.ns
      <bar:grandchild/></child>
    <foo:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.foo.ns
      <bar:grandchild/></foo:child>
    <bar:child xmlns="orig-parent.default.ns" xmlns:foo="http://orig-parent.foo.ns" xmlns:bar="http://orig-parent.bar.ns" xmlns:unused="http://orig-parent.unused.ns">
      ns is orig-parent.bar.ns
      <bar:grandchild/></bar:child>
  </final-parent-with-different-default-ns>
</root>

I think this behavior makes sense, and I'm open to adjusting Nokogiri's DOM node replacement behavior to match (though I worry that proper namespace-handling may be expensive, we'll have to see).

flavorjones avatar Feb 14 '17 14:02 flavorjones

Thank you @flavorjones for the insightful and thorough example. I also wonder how expensive that would be to implement.

kbrock avatar Feb 14 '17 17:02 kbrock