red-datasets icon indicating copy to clipboard operation
red-datasets copied to clipboard

wikipedia: increase REXML entity expansion limit during XML parsing

Open otegami opened this issue 6 months ago • 2 comments

Using Datasets::Wikipedia#each raised an entity expansion has grown too large (RuntimeError). This error occurs because the entity expansion limit in REXML is set by https://github.com/ruby/rexml/pull/187, and Datasets::Wikipedia#each exceeds that limit.

In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets. Therefore, we temporarily increase the limit.

require 'datasets'

wikipedia = Datasets::Wikipedia.new
wikipedia.each do |wiki|
  pp wiki
end
$ cd red-datasets && bundle && bundle exec ruby wiki
/home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError)

otegami avatar Aug 05 '24 13:08 otegami