nokogiri
nokogiri copied to clipboard
allow XPath within XmlReader
From an email from Devlin Daley to nokogiri-talk:
If I had any C extension fu I would add what I think is an awesome approach to Nokogiri for parsing large xml files. The bummer about going to SAX or a reader is that you lose xpath and css selectors. The compromise is to restrict yourself to only forward-looking xpath expressions and register those xpaths. Combine that with a Reader or pull-parser, and you just ask for the next matching element. This is explained by Dare Obsajango: http://msdn.microsoft.com/en-us/library/ms950778.aspx
On a similar note, LibXML2 has a method called expand() explained at the bottom of this page titled "Mixing the reader and tree or XPath operations", that, when you find the start element of the node you're looking for, you can call expand to get a sub-document as a DOM for running selectors on just the interesting subset of the document. http://xmlsoft.org/xmlreader.html
The other ruby libxml library exposes this method in their Reader, but whenever I tried to use it I ran into memory problems and crashes. http://libxml.rubyforge.org/rdoc/classes/LibXML/XML/Reader.html#M000353
- 1
I am running into this very difficulty at the moment. It would be SOOOO much more convenient if the Reader object had an xpath method that essentially created a mini xml document out of the object and allowed XPath and CSS selectors within that subtree.
I definitely vote in favor of this. This would make rendering large XML schema's much faster.
+1! :)
I agree.
+1
+1
+1
+1
+1
OK. We hear you. Scheduling work on this for the 1.5 branch.
Just looking in to this, and I have some thoughts:
This looks like an incredibly tricky feature to add, and here's why: from the xmlTextReaderExpand
documentation:
Returns a node pointer valid until the next xmlTextReaderRead()
Imagine we had some ruby code like this:
nodes = []
reader = Nokogiri::XML.Reader(some_xml)
reader.each do |r|
...
nodes << r.expand.xpath('.//whatever')
end
Any nodes that are exposed in the subtree returned from the expand
call will be invalid pointers on the next iteration of the reader block. Thus our nodes
list will contain a boat load of bad pointers.
If we're going to add this feature, we need to figure out a way to sandbox the entire subtree inside the iteration block. Otherwise, people are going to crash left and right.
I've pushed a branch with a commit that integrates the expand
method. If you pull the branch and run the tests, you'll see it crash and burn:
http://github.com/tenderlove/nokogiri/tree/expand
Q. How, exactly, does what xpath returns differ from an xml document? Is there no way of wrapping a pair pseudo root tags around it and treating the result as an xml document?
Agree with @tenderlove. I've tried hacking his branch to:
- dup and root the subtree (with xmlDocCopyNode) to try to make it persistent
- create a new document, and copy the subtree to that new document
and both are crashing and burning.
To work around this, we'll need to spend some time understanding how memory interaction works between Reader, Document and Node within libxml; and even once we understand it, I'm not sure we'll be able to hack a workaround together inside Nokogiri.
It's late, and I'm tired. I'll look again with fresh eyes later.
I think Reader#outer_xml is a workaround for expand(), but it's probably not as efficient to have to re-parse the string into a doc (after the reader already parsed it to provide the outer_xml).
The project at http://libxml.rubyforge.org/ seems to have found a fix. See closed issue 20117. However, there may be a memory leak (issue 26297).
Awesome. I don't know how you found it (serious googlechaeology?) but here are the deep links:
- http://rubyforge.org/pipermail/libxml-devel/2008-July/000823.html
- http://rubyforge.org/tracker/index.php?func=detail&aid=26297&group_id=494&atid=1971
I'll take a look.
I think what libxml has for this is XmlPattern. From the perl bindings
use XML::LibXML;
my $pattern = XML::LibXML::Pattern->new('/x:html/x:body//x:div', { 'x' => 'http://www.w3.org/1999/xhtml' });
# test a match on an XML::LibXML::Node $node
if ($pattern->matchesNode($node)) { ... }
# or on an XML::LibXML::Reader
if ($reader->matchesPattern($pattern)) { ... }
# or skip reading all nodes that do not match
print $reader->nodePath while $reader->nextPatternMatch($pattern);
$pattern = XML::LibXML::Pattern->new( pattern, { prefix => namespace_URI, ... } );
$bool = $pattern->matchesNode($node);
so if we can get LibXML::Pattern then we can continue use reader to quickly get where we want via a subset of xpath and then read from there.
FYI some performance numbers:
Parsing through a 4 GB XML and expanding 40,000 Nodes takes around 450 Seconds and 280 MB of RAM using nokogiri when creating a new doc from the outer xml and around 95 Seconds and 205 MB RAM using libxml-ruby with reader.expand
.
So indeed xmlTextReaderExpand
is much more efficient.
Maybe a way to discourage the usage of the expanded node outside the current iteration would be to use a block api:
Nokogiri::XML::Reader(file).each do |n|
if n.depth == 2 && n.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT && n.name == 'Product'
# doc = Nokogiri::XML(n.outer_xml)
n.expand do |doc|
# do something with doc
end
end
end
There's probably no efficient way to avoid people from using the document outside the current iteration. The only thing I could think of is to wrap each document, node etc. that is accesses inside the block in something that raises an exception when accessed outside of the block.
I still think this feature would be worthwhile to have, since it's very useful for batch processing of large XML files where all the logic for extracting information can be handled inside a single read operation.
Another approach would be to call xmlTextReaderPreserve
during expand
and xmlTextReaderCurrentDoc
before freeing the reader, but I'm not sure how well that would interact with garbage collection.
i'd still love this