python-boilerpipe
python-boilerpipe copied to clipboard
some urls will not work with celery
Hi,
I have a rather urgent problem, for which I hope you can help me, I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't. If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery. If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.
code:
from celery import Celery
from boilerpipe.extract import Extractor from harvest.celery import app app.config_from_object('harvest.celeryconfig')
def call_txt_extr():
Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()
@app.task def Extract_Text():
URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return
I've tried everything but editing the java code and found the following:
-
the task / boilerpipe stops working at line 70 or so in the Extractor (init.py), "self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()" it simply doesn't give back the parsed text and then the task times out.
-
Please understand It works perfectly with some URL's within celery, others timeout. If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)
-
if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.
So: this works, but is not good code and highly unwanted I think:
class taskclass(celery.Task):
URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()
def call_txt_extr():
Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()
@app.task (base=taskclass) def Extract_Text():
URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return
- updated JPype1
- updated nekohtml
- cannot find any other instance of this on the internet.
I hope you can help me,
Kindest regards,
Roland Zoet
I encountered an error (see below) on the same line, the issue in my case I think is due to a race condition somewhere in java code. Try running your celery worker with --concurrency=1
and see if it works. I don't have a solution for this.
...
extractor = Extractor(extractor='ArticleExtractor', html=html)
File "/usr/local/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 62, in __init__
self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()
Exception: <jpype._jclass.java.lang.NoClassDefFoundError object at 0x126ebb250>
@Zman67 Did you find any solutions?