django-dynamic-scraper
django-dynamic-scraper copied to clipboard
Bug with Scrapyd/DDS
If u have configure celery time to search for new jobs < time spider job need to finish, u ll end up with running same spider job multiple times. The problem is here:
`
def _pending_jobs(self, spider):
# Ommit scheduling new jobs if there are still pending jobs for same spider
resp = urllib.request.urlopen('http://localhost:6800/listjobs.json?project=default')
data = json.loads(resp.read().decode('utf-8'))
if 'pending' in data:
for item in data['pending']:
if item['spider'] == spider:
return True
return False`
I fixed it with ` def _pending_jobs(self, spider):
# Ommit scheduling new jobs if there are still pending jobs for same spider
resp = urllib.request.urlopen('http://localhost:6800/listjobs.json?project=default')
data = json.load(resp)
if any(s in data for s in ('pending', 'running')):
#if 'pending' in data:
for item in data['pending']:
if item['spider'] == spider:
return True
for item in data['running']:
if item['spider'] == spider:
return True
return False`
But a new problem arise. If you have free scrapyd slots, many independent jobs of same spider and your spider job running in one slot is too long , u cant use rest free slots cause your queue is blocked from running spider. I think we must rework scheduling logic to take into account except spider name and args/kwargs too.