nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

Attribute inconsistency caused by `remove_namespaces!`

Open DavidEGrayson opened this issue 11 years ago • 4 comments

Hello. I'd like to report a bug in Nokogiri that is exhibited by the following code:

require 'nokogiri'
doc = Nokogiri::XML('<a b:c="" b:d=""/>')
doc.remove_namespaces!
node = doc.xpath('//a').first
p node
p node['d']
p node.attributes['d'].value

I would expect the last two lines to both print "" to the console, because I would think that they are equivalent ways to access the "d" attribute.

When I run the code in JRuby ( jruby 1.7.13 (1.9.3p392) 2014-06-24 43f133c on Java HotSpot(TM) Client VM 1.7.0_55-b14 [Windows 8-x86] ), I get the following output:

#<Nokogiri::XML::Element:0x832 name="a" attributes=[#<Nokogiri::XML::Attr:0x82e name="d">, #<Nokogiri::XML::Attr:0x830 name="c">]>
nil
""

When I run the code in MRI ( ruby 2.1.2p95 (2014-05-08 revision 45877) [i686-linux] ) I get the following output:

#<Nokogiri::XML::Element:0x..fdc44715c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdc4470d0 name="b:c">, #<Nokogiri::XML::Attr:0x..fdc4470c6 name="b:d">]>
nil
/c/Users/David/Documents/rpicstatic/test2.rb:7:in `<main>': undefined method `value' for nil:NilClass (NoMethodError)

I think that the remove_namespaces! method is buggy because the problem seems to go away if I stop calling that and refer to everything by its full name.

In both cases, I was using Nokogiri version 1.6.3.1.

DavidEGrayson avatar Sep 09 '14 03:09 DavidEGrayson

Hi,

Thanks for reporting this. I agree that this is probably a bug in how #remove_namespaces! is hacking away at the DOM. Thanks for providing a test case; we'll look into fixing for 1.6.6.

flavorjones avatar Dec 02 '14 20:12 flavorjones

OK, this is a weird one, and there are a couple of things being conflated. Let me try to break them out one by one, and then we'll address the inconsistent behavior.

But first, let me point out that the only bug here is inconsistent behavior, as we invented #remove_namespaces!, e.g. it's not defined in the XML specification.

First, let's examine the MRI behavior for your example, but let me assert on the setup:

doc = Nokogiri::XML <<EOXML
  <a b:c="thing-c" b:d="thing-d"/>
EOXML
puts "errors: #{doc.errors}"

node = doc.xpath('//a').first
puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

doc.remove_namespaces!
puts "---"

puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

MRI outputs:

errors: [#<Nokogiri::XML::SyntaxError: Namespace prefix b for c on a is not defined>, #<Nokogiri::XML::SyntaxError: Namespace prefix b for d on a is not defined>]
#<Nokogiri::XML::Element:0xff2ee0 name="a" attributes=[#<Nokogiri::XML::Attr:0xff2ddc name="b:c" value="thing-c">, #<Nokogiri::XML::Attr:0xff2dc8 name="b:d" value="thing-d">]>
nil
nil
---
#<Nokogiri::XML::Element:0xff2ee0 name="a" attributes=[#<Nokogiri::XML::Attr:0xff2ddc name="b:c" value="thing-c">, #<Nokogiri::XML::Attr:0xff2dc8 name="b:d" value="thing-d">]>
nil
nil

So you should notice two things:

  1. there are namespace errors when parsing the document, and
  2. the state of the document is not changed by calling #remove_namespaces!

Now let's see the JRuby output:

errors: [#<Nokogiri::XML::SyntaxError: The prefix "b" for attribute "b:c" associated with an element type "a" is not bound.>, #<Nokogiri::XML::SyntaxError: The prefix "b" for attribute "b:d" associated with an element type "a" is not bound.>]
#<Nokogiri::XML::Element:0x7d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x7d0 name="c" value="thing-c">, #<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">]>
nil
#<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">
---
#<Nokogiri::XML::Element:0x7d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">, #<Nokogiri::XML::Attr:0x7d0 name="c" value="thing-c">]>
nil
#<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">

You'll notice that JRuby is inconsistent with MRI when parsing a document with incorrectly-namespaced entities. MRI/libxml2 will name the attribute "b:d", while JRuby/xerces will name it "d". There are other differences in how the parsers "fix up" broken markup, and this is just one example.

The historical convention on this project, which you may or may not agree with, is to treat inconsistencies in how broken markup is fixed as not-a-bug.

But let's fix up the markup to correctly declare namespaces, and examine the output from each port:

doc = Nokogiri::XML <<EOXML
<root xmlns:a="http://a.flavorjon.es/" xmlns:b="http://b.flavorjon.es/">
  <a b:c="thing-c" b:d="thing-d"/>
</root>
EOXML
puts "errors: #{doc.errors}"

node = doc.xpath('//a').first
puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

doc.remove_namespaces!
puts "---"

puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

MRI:

errors: []
#<Nokogiri::XML::Element:0x131d200 name="a" attributes=[#<Nokogiri::XML::Attr:0x131d19c name="c" namespace=#<Nokogiri::XML::Namespace:0x131d534 prefix="b" href="http://b.flavorjon.es/"> value="thing-c">, #<Nokogiri::XML::Attr:0x131d188 name="d" namespace=#<Nokogiri::XML::Namespace:0x131d534 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">]>
nil
#<Nokogiri::XML::Attr:0x131d188 name="d" namespace=#<Nokogiri::XML::Namespace:0x131d534 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">
---
#<Nokogiri::XML::Element:0x131d200 name="a" attributes=[#<Nokogiri::XML::Attr:0x131d19c name="c" value="thing-c">, #<Nokogiri::XML::Attr:0x131d188 name="d" value="thing-d">]>
"thing-d"
#<Nokogiri::XML::Attr:0x131d188 name="d" value="thing-d">

JRuby:

errors: []
#<Nokogiri::XML::Element:0x7d6 name="a"attributes=[#<Nokogiri::XML::Attr:0x7d2 name="c" namespace=#<Nokogiri::XML::Namespace:0x7d0 prefix="b" href="http://b.flavorjon.es/"> value="thing-c">, #<Nokogiri::XML::Attr:0x7d4 name="d" namespace=#<Nokogiri::XML::Namespace:0x7d0 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">]>
nil
#<Nokogiri::XML::Attr:0x7d4 name="d" namespace=#<Nokogiri::XML::Namespace:0x7d0 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">
---
#<Nokogiri::XML::Element:0x7d6 name="a" attributes=[#<Nokogiri::XML::Attr:0x7d4 name="d" value="thing-d">, #<Nokogiri::XML::Attr:0x7d2 name="c" value="thing-c">]>
nil
#<Nokogiri::XML::Attr:0x7d4 name="d" value="thing-d">

You'll notice that the behavior is somewhat inconsistent, in that after #remove_namespaces the structures match between MRI and JRuby. However, the output of node['d'] differs in that MRI returns "thing-d" and JRuby return nil. This is a bug.

flavorjones avatar Jan 21 '15 01:01 flavorjones

Thanks for looking into it and identifying what the bug really is.

DavidEGrayson avatar Mar 09 '15 15:03 DavidEGrayson

I suspect this is related to the issue described in #1012

flavorjones avatar Jul 02 '24 18:07 flavorjones