boilerpipe icon indicating copy to clipboard operation
boilerpipe copied to clipboard

Limit the parsing depth of the html parsing to avoid out of memory situations

Open GoogleCodeExporter opened this issue 9 years ago • 1 comments

What steps will reproduce the problem?

(using ver. 1.2.0)
1. HTMLParse "http://worldwidescience.org/topicpages/s.html". ArticleExtractor 
is just fine for demonstration purposes.

With 8GB of JVM-memory, this will result in an out of memory exception. 

Attached is a patch, which allows limiting the amount of TextBlocks being 
created/appended by boilerpipe. If that limit is reached, boilerpipe will 
ignore all further content from the parsed input.

Original issue reported on code.google.com by [email protected] on 25 Nov 2013 at 4:29

Attachments:

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter

Please change type to "enhancement"

Original comment by [email protected] on 26 Nov 2013 at 8:13

GoogleCodeExporter avatar Mar 24 '15 10:03 GoogleCodeExporter