boilerpipe
boilerpipe copied to clipboard
Limit the parsing depth of the html parsing to avoid out of memory situations
What steps will reproduce the problem?
(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor
is just fine for demonstration purposes.
With 8GB of JVM-memory, this will result in an out of memory exception.
Attached is a patch, which allows limiting the amount of TextBlocks being
created/appended by boilerpipe. If that limit is reached, boilerpipe will
ignore all further content from the parsed input.
Original issue reported on code.google.com by [email protected]
on 25 Nov 2013 at 4:29
Attachments: