nokogiri
nokogiri copied to clipboard
Nokogiri::XML::Reader | RuntimeError: Could not parse document
What problems are you experiencing? Having an issue parsing very large XML data. The XML data is fetched from a RESTful API and then stored into a local variable called "vuln_data" in this case.
Have attempted many different variations to parse this large XML with no real success. First tried a SAX parser which ended up causing Ruby to segfault.
Then after additional research, attempted to leverage Nokogiri::XML::Reader. Below is the RuntimeError which is created when attempting to read the very large XML file. I've even attempted to use Nokogiri options such as HUGE with no success.
[10] pry(#<:api>)> vuln_data.bytesize => 3515502467 [11] pry(#<:api>)> reader = Nokogiri::XML::Reader(vuln_data) RuntimeError: couldn't create a parser from /Users/username/.rvm/gems/ruby-2.5.1/gems/nokogiri-1.8.2/lib/nokogiri/xml.rb:59:in `from_memory'
When attempting to get a more descriptive RuntimeError you can see the results below:
[13] pry(#<:api>)> begin
[13] pry(#<:api>)* p Nokogiri::XML(vuln_data){ |c| c.strict }
[13] pry(#<:api>)* rescue => err
[13] pry(#<:api>)* p err
[13] pry(#<:api>)* end
#<RuntimeError: Could not parse document>
=> #<RuntimeError: Could not parse document>
What's the output from nokogiri -v?
Nokogiri (1.8.2)
---
warnings: []
nokogiri: 1.8.2
ruby:
version: 2.5.1
platform: x86_64-darwin17
description: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: "/Users/username/.rvm/gems/ruby-2.5.1/gems/nokogiri-1.8.2/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.7"
libxslt_path: "/Users/username/.rvm/gems/ruby-2.5.1/gems/nokogiri-1.8.2/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.32"
libxml2_patches: []
libxslt_patches: []
compiled: 2.9.7
loaded: 2.9.7
Can you provide a self-contained script that reproduces what you're seeing? Unfortunately I cannot. The XML data that we are working with is sensitive in nature and very large. The file size is approximately 3.5gb when stored into a flat file.
Any assistance in what I may be executing improperly, or how I can effectively parse a large XML content would be greatly appreciated.
Hi,
Thanks for opening this issue and asking this question. The error you're seeing is generated from this code in ext/nokogiri/xml_reader.c in the from_memory function:
reader = xmlReaderForMemory(
StringValuePtr(rb_buffer),
(int)RSTRING_LEN(rb_buffer),
c_url,
c_encoding,
c_options
);
if(reader == NULL) {
xmlFreeTextReader(reader);
rb_raise(rb_eRuntimeError, "couldn't create a parser");
}
The documentation for xmlReaderForMemory is here, but the TL;DR is that in case of error, it returns NULL.
Let's look at that function to see why it might be failing ...
xmlTextReaderPtr
xmlReaderForMemory(const char *buffer, int size, const char *URL,
const char *encoding, int options)
{
xmlTextReaderPtr reader;
xmlParserInputBufferPtr buf;
buf = xmlParserInputBufferCreateStatic(buffer, size,
XML_CHAR_ENCODING_NONE);
if (buf == NULL) {
return (NULL);
}
reader = xmlNewTextReader(buf, URL);
if (reader == NULL) {
xmlFreeParserInputBuffer(buf);
return (NULL);
}
reader->allocs |= XML_TEXTREADER_INPUT;
xmlTextReaderSetup(reader, NULL, URL, encoding, options);
return (reader);
}
Looking at this, the most likely culprit is that xmlParserInputBufferCreateStatic, which eventually (via xmlBufCreateStatic) tries to xmlMalloc a buffer equal in length to the string, will fail to allocate a buffer that big. Does that explanation resonate? Is it possible or likely that the process would not be able to allocate another 3.5GB of memory?
If so, then I'd recommend doing one of two things:
- use the streaming method
Nokogiri::XML::Reader.from_io, which operates on anIOobject and should avoid allocating large chunks of memory under the hood - revisit using the SAX parser, which is intended for exactly this scenario. I'd be happy to try to help diagnose where the segfault is occurring (because that shouldn't be happening)
Thoughts?
Hey @flavorjones thanks for the prompt reply here. I certainly thought it could be a memory issue not being able to allocate 3.5gb but that does not seem to be the root issue based on some additional testing.
I took your recommendation to tried and leverage Nokogiri::XML::Reader.from_io. My sample code is seen below:
Nokogiri::XML::Reader.from_io(vuln_data).each do |node|
if node.name == 'ReportHost' and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
host_xml = Nokogiri::XML(node.inner_xml)
import_host(host_xml)
end
end
Once this block is hit; it throws a different error which seems like it does not like the XML structure.
Traceback (most recent call last): Nokogiri::XML::SyntaxError (1:1: FATAL: Extra content at the end of the document)
It is important to note that if we modify the code and remove the .from_io method (Nokogiri::XML::Reader(vuln_data)), the XML will parse and continue without issues when the vuln_data is a smaller string.
Open to any additional insight, and if SAX is the way to go, happy to entertain that route again as well. Cheers.