python-boilerpipe
python-boilerpipe copied to clipboard
java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls
I need to extract article bodies from raw htmls. My code is as simple as:
for html in htmls:
extractor = Extractor(extractor='ArticleExtractor', html=article)
extractor.getHTML()
After calling a method of it, e.g. 10K times, I get java.lang.OutOfMemoryError
error:
Traceback (most recent call last):
File "test.py", line 228, in <module>
extractor.getHTML()
File "/Users/macuser/.virtualenvs/bro/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 70, in getHTML
return highlighter.process(self.source, self.data)
jpype._jexception.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: Java heap space
I looked into the code and it looks like creating BoilerpipeSAXInput
, HTMLHighlighter
and other java instances causes this problem. Is there a way to fix this issue?
To reproduce this without 10K articles, simply reduce the heap size in boilerpipe.__init__
:
MAX_JVM_HEAP_SIZE_MBYTES = 4
if jpype.isJVMStarted() != True:
jars = []
for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):
for nm in files:
jars.append(os.path.join(top, nm))
jvm_args = [
'-Xmx%dM' % MAX_JVM_HEAP_SIZE_MBYTES,
"-Djava.class.path=%s" % os.pathsep.join(jars)
]
jpype.startJVM(jpype.getDefaultJVMPath(), *jvm_args)
Same issue here, about 5k extractions from raw htmls