nokogiri
nokogiri copied to clipboard
[help] Nokogiri::XML::Reader does not recover from parsing errors
What problem are you trying to solve?
We are parsing an XML file containing financial data. The vendor producing this file does not appropriately escape special characters such as &.
We use the Nokogiri::XML::Reader.from_io(file) invocation to stream parsing of these files. When we try to parse a file that has un-escaped special characters, the reader aborts with the error FATAL: EntityRef: expecting ';'
I was expecting that the Nokogiri::XML::ParseOptions::RECOVER option would help us recover from this error. And it does seem to help when invoking an in memory XML parse, but it does not help while using the Reader. Is there a way Nokogiri can help us recover from these errors while also giving us the streaming parsing?
#! /usr/bin/env ruby
require 'nokogiri'
offending_content = '<content>Chapter 1 & 2</content>'
# Errors!
begin
Nokogiri.XML(offending_content, nil, 'UTF-8', Nokogiri::XML::ParseOptions::STRICT)
rescue Nokogiri::XML::SyntaxError
puts 'syntax error!'
end
# Recovers.
Nokogiri.XML(offending_content, nil, 'UTF-8', Nokogiri::XML::ParseOptions::RECOVER)
puts 'recovered!'
# []
# I would think that this could recover, but it does not
begin
Nokogiri::XML::Reader.from_io(StringIO.new(offending_content), nil, 'UTF-8', Nokogiri::XML::ParseOptions::RECOVER).each {}
rescue Nokogiri::XML::SyntaxError
puts 'syntax error from io!'
end
This produces
syntax error!
recovered!
syntax error from io!
Environment
$ bundle exec nokogiri -v
# Nokogiri (1.11.3)
---
warnings: []
nokogiri:
version: 1.11.3
cppflags:
- "-I/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri"
- "-I/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri/include"
- "-I/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri/include/libxml2"
ldflags: []
ruby:
version: 2.5.8
platform: x86_64-darwin20
gem_platform: x86_64-darwin-20
description: ruby 2.5.8p224 (2020-03-31 revision 67882) [x86_64-darwin20]
engine: ruby
libxml:
source: packaged
precompiled: true
patches:
- 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
- 0002-Remove-script-macro-support.patch
- 0003-Update-entities-to-remove-handling-of-ssi.patch
- 0004-libxml2.la-is-in-top_builddir.patch
- 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
- 0006-htmlParseComment-treat-as-if-it-closed-the-comment.patch
- 0007-use-new-htmlParseLookupCommentEnd-to-find-comment-en.patch
- '0008-use-glibc-strlen.patch'
- '0009-avoid-isnan-isinf.patch'
- 0010-parser.c-shrink-the-input-buffer-when-appropriate.patch
- 0011-update-automake-files-for-arm64.patch
libxml2_path: "/Users/mattmcf/.rbenv/versions/2.5.8/lib/ruby/gems/2.5.0/gems/nokogiri-1.11.3-x86_64-darwin/ext/nokogiri"
iconv_enabled: true
compiled: 2.9.10
loaded: 2.9.10
libxslt:
source: packaged
precompiled: true
patches:
- 0001-update-automake-files-for-arm64.patch
compiled: 1.1.34
loaded: 1.1.34
other_libraries:
zlib: 1.2.11
libiconv: '1.15'
@mattmcf Thanks for opening this issue, and sorry that you're having problems.
The Reader parser is ... not well-maintained and hasn't received much love over the years. What you're seeing may be a bug in how Nokogiri is calling the libxml2 API, or it may be a bug (or a "feature") in libxml2 itself.
I need to do a bit of work to figure out what's going on here.
@flavorjones Thank you for the sanity check! Please let me know if there's any additional legwork I can do to help out.
@mattmcf Hey! So I understand what's going here, which is very similar to the bug fixed for HTML DOM parsing in #2130. Pretty easy to fix, but unfortunately the Java implementation really needs some love and so I'm going to take a bit of time to clean up rather than ship a quick fix. Hopefully over the next week I'll have a fix on main, though.
@flavorjones Thank you! That's great to hear.
@mattmcf Just an update here, I've dug in and libxml2's pull parser doesn't seem to be able to recover the way I expect it to. It's not clear to me right now if this is intentional, or a missing feature, or a bug.
I've found this email thread which indicates at least one other person in the universe besides us has noticed this.
But I've also found these docs for the Perl libxml2 wrapper which kind of imply that RECOVER works for the "reader" API.
All of which is to say, this is a bit messier than I originally anticipated. When I get a bit more time I'll dig back in.
Additional find: this thread from 2012 talking about making the Reader's error handling a little less tragic; but it's clear from the libxml2 maintainer's response that the Pull Parser isn't intended to recover from errors.
I'm less confident that the xmlreader API offered by libxml2 can recover from parsing errors. My advice would be to investigate using the SAX parser which is indeed capable of recovering from parsing errors.
@flavorjones Ah shucks. Thank you for pulling on this thread and diving deep into the rabbit hole. I think we will investigate an array of possible solutions including a SAX parser. Thank you for helping root cause this issue!
The XML reader interface uses libxml2's push parser which doesn't support recovery.
@nwellnhof Thanks for validating. I'll close this once I've updated the docs.