scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Exceptions in middleware don't return exit code 1 in `scrapy crawl` & `scrapy check`

Open dpfeif opened this issue 6 years ago • 5 comments

Description

If a middleware raises an exception, running scrapy crawl or scrapy check raises the exception to the shell but returns with exit code 0, instead of the expected 1.

Steps to Reproduce

  1. Set up the tutorial up to here

  2. Create a minimal middleware raising an issue in middlewares.py

class BreakingMiddleware:
    def __init__(self):
        raise Exception("uhoh")
  1. Add the middleware to the quotes spider and a contract for the parse function
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        "SPIDER_MIDDLEWARES" : {
            "tutorial.middlewares.BreakingMiddleware": 100,
        }
    }

    def start_requests(self):
        urls = [
            "http://quotes.toscrape.com/page/1/",
            "http://quotes.toscrape.com/page/2/",
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        """
        @url http://quotes.toscrape.com/page/1/
        @returns items 10 10
        @returns requests 10 10
        """

        page = response.url.split("/")[-2]
        filename = "quotes-%s.html" % page
        with open(filename, "wb") as f:
            f.write(response.body)
        self.log("Saved file %s" % filename)
  1. Execute scrapy check or scrapy crawl quotes

  2. Execute echo $?

Expected behavior: Exit code 1

Actual behavior: Exit code 0

Reproduces how often: 100%

Versions

Scrapy       : 1.8.0
lxml         : 4.4.2.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.10.0
Python       : 3.8.0 (default, Nov 26 2019, 14:40:47) - [Clang 10.0.1 (clang-1001.0.46.4)]
pyOpenSSL    : 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019)
cryptography : 2.8
Platform     : macOS-10.15.2-x86_64-i386-64bit

Additional context

scrapy check logs:

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK
Unhandled error in Deferred:

Traceback (most recent call last):
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 184, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 188, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 86, in crawl
    self.engine = self._create_engine()
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 111, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/core/scraper.py", line 69, in __init__
    self.spidermw = SpiderMiddlewareManager.from_crawler(crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/middleware.py", line 35, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/utils/misc.py", line 146, in create_instance
    return objcls(*args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/tutorial/tutorial/middlewares.py", line 10, in __init__
    raise Exception("uhoh")
builtins.Exception: uhoh

scrapy crawl quotes logs:

2020-01-28 11:29:40 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: tutorial)
2020-01-28 11:29:40 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (default, Nov 26 2019, 14:40:47) - [Clang 10.0.1 (clang-1001.0.46.4)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-01-28 11:29:40 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'tutorial', 'NEWSPIDER_MODULE': 'tutorial.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tutorial.spiders']}
2020-01-28 11:29:40 [scrapy.extensions.telnet] INFO: Telnet Password: c7073899ef38fd40
2020-01-28 11:29:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-01-28 11:29:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
Unhandled error in Deferred:
2020-01-28 11:29:40 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 184, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 188, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 86, in crawl
    self.engine = self._create_engine()
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 111, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/core/scraper.py", line 69, in __init__
    self.spidermw = SpiderMiddlewareManager.from_crawler(crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/middleware.py", line 35, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/utils/misc.py", line 146, in create_instance
    return objcls(*args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/tutorial/tutorial/middlewares.py", line 10, in __init__
    raise Exception("uhoh")
builtins.Exception: uhoh

2020-01-28 11:29:40 [twisted] CRITICAL:
Traceback (most recent call last):
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 86, in crawl
    self.engine = self._create_engine()
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/crawler.py", line 111, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/core/engine.py", line 70, in __init__
    self.scraper = Scraper(crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/core/scraper.py", line 69, in __init__
    self.spidermw = SpiderMiddlewareManager.from_crawler(crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/middleware.py", line 35, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "/Users/dpf/Public/break-scrapy-check/py3/lib/python3.8/site-packages/scrapy/utils/misc.py", line 146, in create_instance
    return objcls(*args, **kwargs)
  File "/Users/dpf/Public/break-scrapy-check/tutorial/tutorial/middlewares.py", line 10, in __init__
    raise Exception("uhoh")
Exception: uhoh

dpfeif avatar Jan 28 '20 10:01 dpfeif

I think that's intended, because a crawl doesn't stop on these exceptions. That's the same as with exceptions in the request callbacks - they're logged, but crawl continues.

kmike avatar Jan 31 '20 18:01 kmike

What about scrapy check returning 0 even if the check could not be performed?

dpfeif avatar Feb 10 '20 15:02 dpfeif

Finding the same issue. scrapy check fails but exist code is still 0. Hence when used in CI/CD (for us Bitbucket Pipelines) the error goes unnoticed.

oizik avatar Apr 12 '21 09:04 oizik

Have you found a way to solve that?

jl00080 avatar Jul 05 '22 14:07 jl00080

Any update on this? I think scrapy should be able to be configured to return an exit code 1, since it told orchestrators like airflow that something went wrong, and therefore stops the dependant tasks.

Jdiego-veritas avatar Oct 24 '23 23:10 Jdiego-veritas