python-boilerpipe icon indicating copy to clipboard operation
python-boilerpipe copied to clipboard

some urls will not work with celery

Open Zman67 opened this issue 10 years ago • 2 comments

Hi,

I have a rather urgent problem, for which I hope you can help me, I'm trying to parse urls/html via boilerpipe and celery. Straightforward stuff, giving a task to a celery worker. However some links work, some don't. If I call call_txt_extr, url: 'http://t.co/XIDUuUIjPi' will not work and disappears in a "soft" followed by a "hard" timeout in celery. If I do the same thing with url 'http://www.rezmanagement.nl' it works perfectly.

code:

from celery import Celery

from boilerpipe.extract import Extractor from harvest.celery import app app.config_from_object('harvest.celeryconfig')

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 

I've tried everything but editing the java code and found the following:

  1. the task / boilerpipe stops working at line 70 or so in the Extractor (init.py), "self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()" it simply doesn't give back the parsed text and then the task times out.

  2. Please understand It works perfectly with some URL's within celery, others timeout. If I remove the celery decorator (thus no longer getting the task executed by celery, it works perfectly, so the URL is ok (Extractor can deal with the html etc.)

  3. if I define a celery class, and configure the task to inherrit the class, and run the extractor call from the class, this works in celery however: this it not the way to run call the Extractor. Furthermore since the Extractor needs inpunt I would be polling for the same URL at every functioncall which is highly unwanted and not supposed to work like that.

So: this works, but is not good code and highly unwanted I think:

class taskclass(celery.Task):

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl'
extractorType="DefaultExtractor"
print Extractor(extractor=extractorType, url=URL).getText()

def call_txt_extr():

Extract_Text.soft_time_limit = 10
Extract_Text.time_limit = 15
Extract_Text.apply_async()

@app.task (base=taskclass) def Extract_Text():

URL = 'http://t.co/XIDUuUIjPi'
# URL = 'http://www.rezmanagement.nl/'
extractorType="DefaultExtractor"
# Extractor(extractor=extractorType, url=URL)
print Extractor(extractor=extractorType, url=URL).getText()
return 
  1. updated JPype1
  2. updated nekohtml
  3. cannot find any other instance of this on the internet.

I hope you can help me,

Kindest regards,

Roland Zoet

Zman67 avatar Nov 24 '14 17:11 Zman67

I encountered an error (see below) on the same line, the issue in my case I think is due to a race condition somewhere in java code. Try running your celery worker with --concurrency=1 and see if it works. I don't have a solution for this.

...
    extractor = Extractor(extractor='ArticleExtractor', html=html)
  File "/usr/local/lib/python2.7/site-packages/boilerpipe/extract/__init__.py", line 62, in __init__
    self.source = BoilerpipeSAXInput(InputSource(reader)).getTextDocument()
Exception: <jpype._jclass.java.lang.NoClassDefFoundError object at 0x126ebb250>

andreip avatar Nov 13 '15 13:11 andreip

@Zman67 Did you find any solutions?

korycins avatar Oct 26 '17 15:10 korycins