The engine doesn't wait for Spider generator parse before run process_spider_output
Description
According to the architect, the order of running functions from the spider middleware should be:
process_spider_input -> Spider's parse -> process_spider_output
but if parse is a generator function, the order becomes:
process_spider_input -> process_spider_output -> Spider's parse
and I don't see it mentioned anywhere in the doc.
Steps to Reproduce
This simple spider reproduce the result:
import scrapy
from scrapy import Request
class YcombinatorSpider(scrapy.Spider):
name = 'ycombinator_news'
allowed_domains = ['news.ycombinator.com']
start_urls = ['http://news.ycombinator.com/']
def parse(self, response):
print('parse')
follow = response.xpath('//a[@class="titlelink"]').getall()
for url in follow:
follow_url = response.urljoin(url)
yield Request(url=follow_url, callback=self.populate_item)
def populate_item(self, response):
print('populate_item')
pass
or this replit: https://replit.com/@trungtin1/scrapy-middleware-reproduce
Versions
scrapy==2.6.1
Could you provide a complete example, spider middleware and log output included? (you can also target toscrape.com for testing purposes)
Could you provide a complete example, spider middleware and log output included? (you can also target toscrape.com for testing purposes)
you can go to this replit https://replit.com/@trungtin1/scrapy-middleware-reproduce
switch to shell tab, and run with scrapy crawl toscrape
and see that "process spider output" is logged first
I see.
Because the result that process_spider_output is that generator, process_spider_output can run code before iterating result, and until result is iterated, parse does nothing. (your example with additional prints)
I guess it makes sense to clarify this in the documentation of process_spider_output, and possibly in the architecture page.