scrapy-crawl-once URLs in start_urls are not affected

URLs in start_urls are not affected

Open bezkos opened this issue 7 years ago • 2 comments

I have a spider crawl only detail pages and they are never skipped by this middleware.

May 30 '17 18:05 bezkos

A good catch; we need to add process_start_requests method as well.

May 30 '17 19:05 kmike

@bezkos Are you use meta={'crawl_once': True}? I tested middleware using this simple spider, and that's works correctly.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'crawl_once': True})

def parse(self, response):
    yield {
        'title': response.css('h1 a::text').extract_first(),
    }

First run - request sent.

{'crawl_once/initial': 0,
 'crawl_once/stored': 1,
 'downloader/request_bytes': 231,
 'downloader/request_count': 1}

Second run - request ignored.

{'crawl_once/ignored': 1,
 'crawl_once/initial': 1,
 'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1}

Note: requests generated by start_urls has not crawl_once in meta dictionary by default. For append it, use start_requests method.

Can you explain what problem you had?

Jun 05 '18 11:06 Verz1Lka

scrapy-crawl-once scrapy-crawl-once copied to clipboard

URLs in start_urls are not affected

scrapy-crawl-once
scrapy-crawl-once copied to clipboard