nutch-python
nutch-python copied to clipboard
runtime error at AWS
Hi,
I was able to run the following code on my own linux machine without a problem:
from nutch.nutch import Nutch
from nutch.nutch import SeedClient
from nutch.nutch import Server
from nutch.nutch import JobClient
import nutch
sv=Server('http://localhost:8081')
sc=SeedClient(sv)
seed_urls=('http://espn.go.com','http://www.espn.com')
sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
if job == None:
break
however, when I run the same code on AWS (ubuntu 14.04), it gives a runtime error. here is the runtime log of the code:
nutch.py: Response status: 200
nutch.py: Response JSON: {u'crawlId': u'test', u'args': {u'url_dir': u'/tmp/1456875353316-0'}, u'state': u'IDLE', u'result': None, u'msg': u'idle', u'type': u'GENERATE', u'id': u'test-default-GENERATE-1140031758', u'confId': u'default'}
nutch.py: GET Endpoint: /job/test-default-GENERATE-1140031758
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Date': 'Tue, 01 Mar 2016 23:36:35 GMT', 'Content-Length': '0', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 204
Traceback (most recent call last):
File "main.py", line 22, in
nutch.nutch.NutchException: Unexpected server response: 204
in order to run the python code, I was running nutch as: /bin/nutch startserver, here is the run the
Injector: starting at 2016-03-01 23:35:53 Injector: crawlDb: test/crawldb Injector: urlDir: /tmp/1456875353316-0 Injector: Converting injected urls to crawl db entries. Injector: overwrite: false Injector: update: false Injector: Total number of urls rejected by filters: 0 Injector: Total number of urls after normalization: 2 Injector: Total new urls injected: 2 Injector: finished at 2016-03-01 23:36:34, elapsed: 00:00:40 Generator: starting at 2016-03-01 23:36:35 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: running in local mode, generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: test/segments/20160301233638 Generator: finished at 2016-03-01 23:36:40, elapsed: 00:00:05
I would appreciate if you can help.
Thanks