django-dynamic-scraper icon indicating copy to clipboard operation
django-dynamic-scraper copied to clipboard

Bug with Scrapyd/DDS

Open bezkos opened this issue 8 years ago • 0 comments

If u have configure celery time to search for new jobs < time spider job need to finish, u ll end up with running same spider job multiple times. The problem is here:

`

    def _pending_jobs(self, spider):

    # Ommit scheduling new jobs if there are still pending jobs for same spider
    resp = urllib.request.urlopen('http://localhost:6800/listjobs.json?project=default')
    data = json.loads(resp.read().decode('utf-8'))
    if 'pending' in data:
        for item in data['pending']:
            if item['spider'] == spider:
                return True
    return False`

I fixed it with ` def _pending_jobs(self, spider):

    # Ommit scheduling new jobs if there are still pending jobs for same spider
    resp = urllib.request.urlopen('http://localhost:6800/listjobs.json?project=default')
    data = json.load(resp)     
    if any(s in data for s in ('pending', 'running')):
    #if 'pending' in data:            
        for item in data['pending']:
            if item['spider'] == spider:
                return True
        for item in data['running']:
            if item['spider'] == spider:
                return True
    return False`

But a new problem arise. If you have free scrapyd slots, many independent jobs of same spider and your spider job running in one slot is too long , u cant use rest free slots cause your queue is blocked from running spider. I think we must rework scheduling logic to take into account except spider name and args/kwargs too.

bezkos avatar Jul 08 '17 22:07 bezkos