nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

[help] Nokogiri::XML::Reader does not recover from parsing errors

Open mattmcf opened this issue 4 years ago • 7 comments

What problem are you trying to solve?

We are parsing an XML file containing financial data. The vendor producing this file does not appropriately escape special characters such as &.

We use the Nokogiri::XML::Reader.from_io(file) invocation to stream parsing of these files. When we try to parse a file that has un-escaped special characters, the reader aborts with the error FATAL: EntityRef: expecting ';'

I was expecting that the Nokogiri::XML::ParseOptions::RECOVER option would help us recover from this error. And it does seem to help when invoking an in memory XML parse, but it does not help while using the Reader. Is there a way Nokogiri can help us recover from these errors while also giving us the streaming parsing?

#! /usr/bin/env ruby

require 'nokogiri'

offending_content = '<content>Chapter 1 & 2</content>'

# Errors!
begin
  Nokogiri.XML(offending_content, nil, 'UTF-8', Nokogiri::XML::ParseOptions::STRICT)
rescue Nokogiri::XML::SyntaxError
  puts 'syntax error!'
end

# Recovers.
Nokogiri.XML(offending_content, nil, 'UTF-8', Nokogiri::XML::ParseOptions::RECOVER)
puts 'recovered!'
# []

# I would think that this could recover, but it does not
begin
  Nokogiri::XML::Reader.from_io(StringIO.new(offending_content), nil, 'UTF-8', Nokogiri::XML::ParseOptions::RECOVER).each {}
rescue Nokogiri::XML::SyntaxError
  puts 'syntax error from io!'
end 

This produces

syntax error!
recovered!
syntax error from io!

Environment

$ bundle exec nokogiri -v
# Nokogiri (1.11.3)
    ---
    warnings: []
    nokogiri:
      version: 1.11.3
      cppflags:
      - "-I/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri"
      - "-I/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri/include"
      - "-I/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.5.8
      platform: x86_64-darwin20
      gem_platform: x86_64-darwin-20
      description: ruby 2.5.8p224 (2020-03-31 revision 67882) [x86_64-darwin20]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      - 0006-htmlParseComment-treat-as-if-it-closed-the-comment.patch
      - 0007-use-new-htmlParseLookupCommentEnd-to-find-comment-en.patch
      - '0008-use-glibc-strlen.patch'
      - '0009-avoid-isnan-isinf.patch'
      - 0010-parser.c-shrink-the-input-buffer-when-appropriate.patch
      - 0011-update-automake-files-for-arm64.patch
      libxml2_path: "/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri"
      iconv_enabled: true
      compiled: 2.9.10
      loaded: 2.9.10
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
      libiconv: '1.15'

mattmcf avatar May 05 '21 17:05 mattmcf

@mattmcf Thanks for opening this issue, and sorry that you're having problems.

The Reader parser is ... not well-maintained and hasn't received much love over the years. What you're seeing may be a bug in how Nokogiri is calling the libxml2 API, or it may be a bug (or a "feature") in libxml2 itself.

I need to do a bit of work to figure out what's going on here.

flavorjones avatar May 07 '21 11:05 flavorjones

@flavorjones Thank you for the sanity check! Please let me know if there's any additional legwork I can do to help out.

mattmcf avatar May 07 '21 14:05 mattmcf

@mattmcf Hey! So I understand what's going here, which is very similar to the bug fixed for HTML DOM parsing in #2130. Pretty easy to fix, but unfortunately the Java implementation really needs some love and so I'm going to take a bit of time to clean up rather than ship a quick fix. Hopefully over the next week I'll have a fix on main, though.

flavorjones avatar May 13 '21 12:05 flavorjones

@flavorjones Thank you! That's great to hear.

mattmcf avatar May 13 '21 15:05 mattmcf

@mattmcf Just an update here, I've dug in and libxml2's pull parser doesn't seem to be able to recover the way I expect it to. It's not clear to me right now if this is intentional, or a missing feature, or a bug.

I've found this email thread which indicates at least one other person in the universe besides us has noticed this.

But I've also found these docs for the Perl libxml2 wrapper which kind of imply that RECOVER works for the "reader" API.

All of which is to say, this is a bit messier than I originally anticipated. When I get a bit more time I'll dig back in.

flavorjones avatar May 18 '21 04:05 flavorjones

Additional find: this thread from 2012 talking about making the Reader's error handling a little less tragic; but it's clear from the libxml2 maintainer's response that the Pull Parser isn't intended to recover from errors.

I'm less confident that the xmlreader API offered by libxml2 can recover from parsing errors. My advice would be to investigate using the SAX parser which is indeed capable of recovering from parsing errors.

flavorjones avatar May 18 '21 04:05 flavorjones

@flavorjones Ah shucks. Thank you for pulling on this thread and diving deep into the rabbit hole. I think we will investigate an array of possible solutions including a SAX parser. Thank you for helping root cause this issue!

mattmcf avatar May 19 '21 15:05 mattmcf

The XML reader interface uses libxml2's push parser which doesn't support recovery.

nwellnhof avatar Dec 27 '23 13:12 nwellnhof

@nwellnhof Thanks for validating. I'll close this once I've updated the docs.

flavorjones avatar Dec 27 '23 17:12 flavorjones