nutch-python icon indicating copy to clipboard operation
nutch-python copied to clipboard

runtime error at AWS

Open mbnik opened this issue 8 years ago • 5 comments

Hi,

I was able to run the following code on my own linux machine without a problem:


from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch

sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls) 

nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
    job = cc.progress() # gets the current job if no progress, else iterates and makes progress
    if job == None:
        break

however, when I run the same code on AWS (ubuntu 14.04), it gives a runtime error. here is the runtime log of the code:


nutch.py: Response status: 200 nutch.py: Response JSON: {u'crawlId': u'test', u'args': {u'url_dir': u'/tmp/1456875353316-0'}, u'state': u'IDLE', u'result': None, u'msg': u'idle', u'type': u'GENERATE', u'id': u'test-default-GENERATE-1140031758', u'confId': u'default'} nutch.py: GET Endpoint: /job/test-default-GENERATE-1140031758 nutch.py: GET Request data: {} nutch.py: GET Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Date': 'Tue, 01 Mar 2016 23:36:35 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 204 Traceback (most recent call last): File "main.py", line 22, in job = cc.progress() # gets the current job if no progress, else iterates and makes progress File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 531, in progress jobInfo = currentJob.info() File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 201, in info return self.server.call('get', '/job/' + self.id) File "/usr/local/lib/python2.7/dist-packages/nutch/nutch.py", line 160, in call raise error

nutch.nutch.NutchException: Unexpected server response: 204

in order to run the python code, I was running nutch as: /bin/nutch startserver, here is the run the

Injector: starting at 2016-03-01 23:35:53 Injector: crawlDb: test/crawldb Injector: urlDir: /tmp/1456875353316-0 Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total number of urls rejected by filters: 0 Injector: Total number of urls after normalization: 2 Injector: Total new urls injected: 2 Injector: finished at 2016-03-01 23:36:34, elapsed: 00:00:40 Generator: starting at 2016-03-01 23:36:35 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: test/segments/20160301233638 Generator: finished at 2016-03-01 23:36:40, elapsed: 00:00:05


I would appreciate if you can help.

Thanks

mbnik avatar Mar 01 '16 23:03 mbnik