java.lang.OutOfMemoryError: Java heap space after multiple getHTML calls

Open alibozorgkhan opened this issue 11 years ago • 1 comments

I need to extract article bodies from raw htmls. My code is as simple as:

for html in htmls:
    extractor = Extractor(extractor='ArticleExtractor', html=article)
    extractor.getHTML()

After calling a method of it, e.g. 10K times, I get java.lang.OutOfMemoryError error:

Traceback (most recent call last):
  File "test.py", line 228, in <module>
    extractor.getHTML()
  File "/Users/macuser/.virtualenvs/bro/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 70, in getHTML
    return highlighter.process(self.source, self.data)
jpype._jexception.OutOfMemoryErrorPyRaisable: java.lang.OutOfMemoryError: Java heap space

I looked into the code and it looks like creating BoilerpipeSAXInput, HTMLHighlighter and other java instances causes this problem. Is there a way to fix this issue?

To reproduce this without 10K articles, simply reduce the heap size in boilerpipe.__init__:

MAX_JVM_HEAP_SIZE_MBYTES = 4

if jpype.isJVMStarted() != True:
    jars = []
    for top, dirs, files in os.walk(imp.find_module('boilerpipe')[1]+'/data'):
        for nm in files:
            jars.append(os.path.join(top, nm))

    jvm_args = [
        '-Xmx%dM' % MAX_JVM_HEAP_SIZE_MBYTES,
        "-Djava.class.path=%s" % os.pathsep.join(jars)
    ]
    jpype.startJVM(jpype.getDefaultJVMPath(), *jvm_args)

Dec 31 '14 23:12 alibozorgkhan

Same issue here, about 5k extractions from raw htmls

Sep 23 '16 03:09 joelthchao