nokogiri NodeSet does not follow ruby conventions for enumerable methods

trafficstars

Using ~/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/nokogiri-1.8.1

In general, the ruby enumerable methods, like Array.select, return a new Array, leaving the original untouched. For example,

x = [*1..10]
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = x.select {|i| i % 2 == 0 }
#=> [2, 4, 6, 8, 10]
x
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Similarly, another enumerable operation does not touch the original:

x = [*1..10]
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = x.reject {|i| i % 2 == 0 }
=> [1, 3, 5, 7, 9]
x
#=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

It's reasonable to expect the same behavior of enumerable methods on the NodeSet, right? Let's see.

Create two sets of XML, with a duplicate REC element in each (WOS:A2):

recordsA = <<-XML_A
        <records>
          <REC><UID>WOS:A1</UID></REC>
          <REC><UID>WOS:A2</UID></REC>
        </records>
XML_A
recordsB = <<-XML_B
        <records>
          <REC><UID>WOS:A2</UID></REC>
          <REC><UID>WOS:B2</UID></REC>
        </records>
XML_B

docA = Nokogiri::XML(recordsA) { |config| config.strict.noblanks }
docB = Nokogiri::XML(recordsB) { |config| config.strict.noblanks }

rec_nodesA = docA.search('REC')
rec_nodesB = docB.search('REC')
rec_nodesA.class
#=> Nokogiri::XML::NodeSet

uidsA = rec_nodesA.map { |rec| rec.search('UID').first.text }
#=> ["WOS:A1", "WOS:A2"]
uidsB = rec_nodesB.map { |rec| rec.search('UID').first.text }
#=> ["WOS:A2", "WOS:B2"]
uidsA.class
#=> Array  # note that it doesn't return a NodeSet
uid_intersection = uidsA.to_set.intersection(uidsB.to_set)
#=> #<Set: {"WOS:A2"}>   # OK, no problem here

Now lets try to gather the duplicate records into a new NodeSet.

duplicates = rec_nodesB.select { |rec| uid_intersection.include?(rec.search('UID').first.text) }
#=> [#(Element:0x2ab559c1d548 { name = "REC", children = [ #(Element:0x2ab559c21634 { name = "UID", children = [ #(Text "WOS:A2")] })] })]

dup_nodes = Nokogiri::XML::NodeSet.new(Nokogiri::XML::Document.new, duplicates)
#=> [#<Nokogiri::XML::Element:0x2ab559c1d548 name="REC" children=[#<Nokogiri::XML::Element:0x2ab559c21634 name="UID" children=[#<Nokogiri::XML::Text:0x2ab559c25428 "WOS:A2">]>]>]
dup_nodes.class
#=> Nokogiri::XML::NodeSet

# seems to be OK, right?

# BUT HERE COMES THE FUN BIT....

dup_nodes.xpath('//UID')
#=> [#<Nokogiri::XML::Element:0x2ab559c21634 name="UID" 
#  children=[#<Nokogiri::XML::Text:0x2ab559c25428 "WOS:A2">]>,
#  #<Nokogiri::XML::Element:0x2ab559c2c0d4 name="UID" 
#  children=[#<Nokogiri::XML::Text:0x2ab559c2e6e0 "WOS:B2">]>]

# Yea, fun, right?  But wait, there's more!

dup_nodes.count
#=> 1
dup_nodes.map { |rec| rec.search('UID').first.text }
#=> ["WOS:A2"]
dup_nodes.xpath('//UID').count
#=> 2
dup_nodes.xpath('//UID').to_s
#=> "<UID>WOS:A2</UID><UID>WOS:B2</UID>"

The dup_nodes.count => 1 and the first line above shows that it has the UID=‘WOS:A2’. So, how can an xpath('//UID') return two different UID values from this NodeSet that seems to only contain one UID?

The only way that dup_nodes.xpath('//UID') could return two results with WOS:A2 and WOS:B2 is if the dup_nodes is somehow still connected to the parent doc that it was derived from, despite a NodeSet.select operation that should not touch the original node set and should return a new object. Although I haven't yet proved it in a similar small example, I believe similar problems beset the NodeSet.dup method.

Oct 03 '17 04:10 dazza-codes

Change your assignment line to:

dup_nodes = Nokogiri::XML::NodeSet.new(Nokogiri::XML::Document.new, duplicates.to_xml)

And I suspect you get the desired behavior.

Also, you can cut out some middle steps by applying .xpath:

uidsA = docA.xpath('//REC/UID').map(&:text)

and Array already has an intersection operator in Ruby, so you don't need to convert to Set:

uid_intersection = uidsA & uidsB
=> ["WOS:A2"]

Oct 05 '17 00:10 atz

Hi, thanks for telling us about your experiences using Nokogiri.

You're performing some interesting and unexpected operations, and so I'm curious if Nokogiri's behavior is really getting in your way, or if you're just being curious/pedantic. To help me understand, could you tell me more about what you're trying to do with Nokogiri?

By this question, I mean that it's more interesting to me to get users from point A to point B efficiently than to figure out how to prevent users from breaking some expectations that Nokogiri has that may not be obvious.

Feb 05 '18 04:02 flavorjones

To step back a bit, I wanted to merge one set of <REC> records into another set of <REC> records, without any duplicate <REC> elements. The unique identifier for each <REC> is in the <REC><UID> element text and that should be used to identify duplicates.

I was trying to work with a Nokogiri document as an immutable instance. I was using .dup or .deep_copy and some enumerable methods. I experimented with some details and noticed some puzzling behavior. I found some confidence with deserializing the content with .to_xml to get it out of one Nokogiri instance and into another. On reflection, I didn't try to apply .freeze to anything, maybe that would make a Nokogiri instance immutable.

There are some details at

https://github.com/sul-dlss/sul_pub/blob/master/lib/web_of_science/records.rb#L70-L111
https://github.com/sul-dlss/sul_pub/blob/master/spec/lib/web_of_science/records_spec.rb#L12-L29
https://github.com/sul-dlss/sul_pub/blob/master/spec/lib/web_of_science/records_spec.rb#L108-L176

Feb 10 '18 18:02 dazza-codes

I'm sorry, I'm not totally following. I looked at the code, and the specs, and I understand what you're trying to do ... do these tests pass? If so, can you help me understand what Nokogiri is doing that's unexpected by writing a failing test?

Feb 17 '18 02:02 flavorjones

nokogiri nokogiri copied to clipboard

NodeSet does not follow ruby conventions for enumerable methods

nokogiri
nokogiri copied to clipboard