scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

The engine doesn't wait for Spider generator parse before run process_spider_output

Open trungtin opened this issue 3 years ago • 3 comments

Description

According to the architect, the order of running functions from the spider middleware should be:

process_spider_input -> Spider's parse -> process_spider_output

but if parse is a generator function, the order becomes:

process_spider_input -> process_spider_output -> Spider's parse

and I don't see it mentioned anywhere in the doc.

Steps to Reproduce

This simple spider reproduce the result:

import scrapy
from scrapy import Request

class YcombinatorSpider(scrapy.Spider):
    name = 'ycombinator_news'
    allowed_domains = ['news.ycombinator.com']
    start_urls = ['http://news.ycombinator.com/']

    def parse(self, response):
        print('parse')
        follow = response.xpath('//a[@class="titlelink"]').getall()
        for url in follow:
            follow_url = response.urljoin(url)
            yield Request(url=follow_url, callback=self.populate_item)

    def populate_item(self, response):
        print('populate_item')
        pass

or this replit: https://replit.com/@trungtin1/scrapy-middleware-reproduce

Versions

scrapy==2.6.1

trungtin avatar Jul 01 '22 15:07 trungtin

Could you provide a complete example, spider middleware and log output included? (you can also target toscrape.com for testing purposes)

Gallaecio avatar Jul 01 '22 15:07 Gallaecio

Could you provide a complete example, spider middleware and log output included? (you can also target toscrape.com for testing purposes)

you can go to this replit https://replit.com/@trungtin1/scrapy-middleware-reproduce switch to shell tab, and run with scrapy crawl toscrape

image

and see that "process spider output" is logged first

trungtin avatar Jul 01 '22 15:07 trungtin

I see.

Because the result that process_spider_output is that generator, process_spider_output can run code before iterating result, and until result is iterated, parse does nothing. (your example with additional prints)

I guess it makes sense to clarify this in the documentation of process_spider_output, and possibly in the architecture page.

Gallaecio avatar Jul 01 '22 16:07 Gallaecio