nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

Nokogiri::XML::Reader | RuntimeError: Could not parse document

Open zeknox opened this issue 7 years ago • 2 comments

What problems are you experiencing? Having an issue parsing very large XML data. The XML data is fetched from a RESTful API and then stored into a local variable called "vuln_data" in this case.

Have attempted many different variations to parse this large XML with no real success. First tried a SAX parser which ended up causing Ruby to segfault.

Then after additional research, attempted to leverage Nokogiri::XML::Reader. Below is the RuntimeError which is created when attempting to read the very large XML file. I've even attempted to use Nokogiri options such as HUGE with no success.

[10] pry(#<:api>)> vuln_data.bytesize
=> 3515502467
[11] pry(#<:api>)> reader = Nokogiri::XML::Reader(vuln_data)
RuntimeError: couldn't create a parser
from /Users/username/.rvm/gems/ruby-2.5.1/gems/nokogiri-1.8.2/lib/nokogiri/xml.rb:59:in `from_memory'

When attempting to get a more descriptive RuntimeError you can see the results below:

[13] pry(#<:api>)> begin
[13] pry(#<:api>)*   p Nokogiri::XML(vuln_data){ |c| c.strict }  
[13] pry(#<:api>)* rescue => err  
[13] pry(#<:api>)*   p err  
[13] pry(#<:api>)* end  
#<RuntimeError: Could not parse document>
=> #<RuntimeError: Could not parse document>

What's the output from nokogiri -v? Nokogiri (1.8.2) --- warnings: [] nokogiri: 1.8.2 ruby: version: 2.5.1 platform: x86_64-darwin17 description: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17] engine: ruby libxml: binding: extension source: packaged libxml2_path: "/Users/username/.rvm/gems/ruby-2.5.1/gems/nokogiri-1.8.2/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.7" libxslt_path: "/Users/username/.rvm/gems/ruby-2.5.1/gems/nokogiri-1.8.2/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.32" libxml2_patches: [] libxslt_patches: [] compiled: 2.9.7 loaded: 2.9.7

Can you provide a self-contained script that reproduces what you're seeing? Unfortunately I cannot. The XML data that we are working with is sensitive in nature and very large. The file size is approximately 3.5gb when stored into a flat file.

Any assistance in what I may be executing improperly, or how I can effectively parse a large XML content would be greatly appreciated.

zeknox avatar May 21 '18 18:05 zeknox

Hi,

Thanks for opening this issue and asking this question. The error you're seeing is generated from this code in ext/nokogiri/xml_reader.c in the from_memory function:

  reader = xmlReaderForMemory(
      StringValuePtr(rb_buffer),
      (int)RSTRING_LEN(rb_buffer),
      c_url,
      c_encoding,
      c_options
  );

  if(reader == NULL) {
    xmlFreeTextReader(reader);
    rb_raise(rb_eRuntimeError, "couldn't create a parser");
  }

The documentation for xmlReaderForMemory is here, but the TL;DR is that in case of error, it returns NULL.

Let's look at that function to see why it might be failing ...

xmlTextReaderPtr
xmlReaderForMemory(const char *buffer, int size, const char *URL,
                   const char *encoding, int options)
{
    xmlTextReaderPtr reader;
    xmlParserInputBufferPtr buf;

    buf = xmlParserInputBufferCreateStatic(buffer, size,
                                      XML_CHAR_ENCODING_NONE);
    if (buf == NULL) {
        return (NULL);
    }
    reader = xmlNewTextReader(buf, URL);
    if (reader == NULL) {
        xmlFreeParserInputBuffer(buf);
        return (NULL);
    }
    reader->allocs |= XML_TEXTREADER_INPUT;
    xmlTextReaderSetup(reader, NULL, URL, encoding, options);
    return (reader);
}

Looking at this, the most likely culprit is that xmlParserInputBufferCreateStatic, which eventually (via xmlBufCreateStatic) tries to xmlMalloc a buffer equal in length to the string, will fail to allocate a buffer that big. Does that explanation resonate? Is it possible or likely that the process would not be able to allocate another 3.5GB of memory?

If so, then I'd recommend doing one of two things:

  • use the streaming method Nokogiri::XML::Reader.from_io, which operates on an IO object and should avoid allocating large chunks of memory under the hood
  • revisit using the SAX parser, which is intended for exactly this scenario. I'd be happy to try to help diagnose where the segfault is occurring (because that shouldn't be happening)

Thoughts?

flavorjones avatar May 21 '18 20:05 flavorjones

Hey @flavorjones thanks for the prompt reply here. I certainly thought it could be a memory issue not being able to allocate 3.5gb but that does not seem to be the root issue based on some additional testing.

I took your recommendation to tried and leverage Nokogiri::XML::Reader.from_io. My sample code is seen below:

Nokogiri::XML::Reader.from_io(vuln_data).each do |node|
  if node.name == 'ReportHost' and node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
    host_xml = Nokogiri::XML(node.inner_xml)
    import_host(host_xml)
  end
end

Once this block is hit; it throws a different error which seems like it does not like the XML structure.

Traceback (most recent call last):
Nokogiri::XML::SyntaxError (1:1: FATAL: Extra content at the end of the document)

It is important to note that if we modify the code and remove the .from_io method (Nokogiri::XML::Reader(vuln_data)), the XML will parse and continue without issues when the vuln_data is a smaller string.

Open to any additional insight, and if SAX is the way to go, happy to entertain that route again as well. Cheers.

zeknox avatar May 21 '18 21:05 zeknox