nokogiri allow XPath within XmlReader

From an email from Devlin Daley to nokogiri-talk:

If I had any C extension fu I would add what I think is an awesome approach to Nokogiri for parsing large xml files. The bummer about going to SAX or a reader is that you lose xpath and css selectors. The compromise is to restrict yourself to only forward-looking xpath expressions and register those xpaths. Combine that with a Reader or pull-parser, and you just ask for the next matching element. This is explained by Dare Obsajango: http://msdn.microsoft.com/en-us/library/ms950778.aspx

On a similar note, LibXML2 has a method called expand() explained at the bottom of this page titled "Mixing the reader and tree or XPath operations", that, when you find the start element of the node you're looking for, you can call expand to get a sub-document as a DOM for running selectors on just the interesting subset of the document. http://xmlsoft.org/xmlreader.html

The other ruby libxml library exposes this method in their Reader, but whenever I tried to use it I ran into memory problems and crashes. http://libxml.rubyforge.org/rdoc/classes/LibXML/XML/Reader.html#M000353

Apr 15 '10 19:04 flavorjones

1

I am running into this very difficulty at the moment. It would be SOOOO much more convenient if the Reader object had an xpath method that essentially created a mini xml document out of the object and allowed XPath and CSS selectors within that subtree.

Apr 16 '10 12:04 byrnejb

I definitely vote in favor of this. This would make rendering large XML schema's much faster.

+1! :)

Apr 17 '10 01:04 jmbrink26

I agree.

+1

Apr 17 '10 17:04 jbasdf

+1

Apr 18 '10 18:04 davidrichards

+1

Apr 19 '10 07:04 bmidgley

+1

Apr 19 '10 08:04 cayblood

+1

Apr 19 '10 15:04 ewollesen

OK. We hear you. Scheduling work on this for the 1.5 branch.

May 04 '10 12:05 flavorjones

Just looking in to this, and I have some thoughts:

This looks like an incredibly tricky feature to add, and here's why: from the xmlTextReaderExpand documentation:

Returns a node pointer valid until the next xmlTextReaderRead()

Imagine we had some ruby code like this:

nodes = []
reader = Nokogiri::XML.Reader(some_xml)
reader.each do |r|
  ...
  nodes << r.expand.xpath('.//whatever')
end

Any nodes that are exposed in the subtree returned from the expand call will be invalid pointers on the next iteration of the reader block. Thus our nodes list will contain a boat load of bad pointers.

If we're going to add this feature, we need to figure out a way to sandbox the entire subtree inside the iteration block. Otherwise, people are going to crash left and right.

Jun 23 '10 20:06 tenderlove

I've pushed a branch with a commit that integrates the expand method. If you pull the branch and run the tests, you'll see it crash and burn:

http://github.com/tenderlove/nokogiri/tree/expand

Jun 23 '10 20:06 tenderlove

Q. How, exactly, does what xpath returns differ from an xml document? Is there no way of wrapping a pair pseudo root tags around it and treating the result as an xml document?

Jun 23 '10 20:06 byrnejb

Agree with @tenderlove. I've tried hacking his branch to:

dup and root the subtree (with xmlDocCopyNode) to try to make it persistent
create a new document, and copy the subtree to that new document

and both are crashing and burning.

To work around this, we'll need to spend some time understanding how memory interaction works between Reader, Document and Node within libxml; and even once we understand it, I'm not sure we'll be able to hack a workaround together inside Nokogiri.

It's late, and I'm tired. I'll look again with fresh eyes later.

Jun 24 '10 07:06 flavorjones

I think Reader#outer_xml is a workaround for expand(), but it's probably not as efficient to have to re-parse the string into a doc (after the reader already parsed it to provide the outer_xml).

The project at http://libxml.rubyforge.org/ seems to have found a fix. See closed issue 20117. However, there may be a memory leak (issue 26297).

Nov 02 '10 21:11 kliuless

Awesome. I don't know how you found it (serious googlechaeology?) but here are the deep links:

http://rubyforge.org/pipermail/libxml-devel/2008-July/000823.html
http://rubyforge.org/tracker/index.php?func=detail&aid=26297&group_id=494&atid=1971

I'll take a look.

Nov 02 '10 21:11 flavorjones

I think what libxml has for this is XmlPattern. From the perl bindings

use XML::LibXML;
  my $pattern = XML::LibXML::Pattern->new('/x:html/x:body//x:div', { 'x' => 'http://www.w3.org/1999/xhtml' });
  # test a match on an XML::LibXML::Node $node

  if ($pattern->matchesNode($node)) { ... }

  # or on an XML::LibXML::Reader

  if ($reader->matchesPattern($pattern)) { ... }

  # or skip reading all nodes that do not match

  print $reader->nodePath while $reader->nextPatternMatch($pattern);

  $pattern = XML::LibXML::Pattern->new( pattern, { prefix => namespace_URI, ... } );
  $bool = $pattern->matchesNode($node);

so if we can get LibXML::Pattern then we can continue use reader to quickly get where we want via a subset of xpath and then read from there.

Aug 27 '13 20:08 dsisnero

FYI some performance numbers:

Parsing through a 4 GB XML and expanding 40,000 Nodes takes around 450 Seconds and 280 MB of RAM using nokogiri when creating a new doc from the outer xml and around 95 Seconds and 205 MB RAM using libxml-ruby with reader.expand.

So indeed xmlTextReaderExpand is much more efficient.

Maybe a way to discourage the usage of the expanded node outside the current iteration would be to use a block api:

Nokogiri::XML::Reader(file).each do |n|
  if n.depth == 2 && n.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT && n.name == 'Product'
    # doc = Nokogiri::XML(n.outer_xml)
    n.expand do |doc|
      # do something with doc
    end
  end
end

There's probably no efficient way to avoid people from using the document outside the current iteration. The only thing I could think of is to wrap each document, node etc. that is accesses inside the block in something that raises an exception when accessed outside of the block.

I still think this feature would be worthwhile to have, since it's very useful for batch processing of large XML files where all the logic for extracting information can be handled inside a single read operation.

Another approach would be to call xmlTextReaderPreserve during expand and xmlTextReaderCurrentDoc before freeing the reader, but I'm not sure how well that would interact with garbage collection.

Apr 04 '18 17:04 felixbuenemann

i'd still love this

Jun 15 '18 03:06 jrochkind

nokogiri nokogiri copied to clipboard

allow XPath within XmlReader

nokogiri
nokogiri copied to clipboard