scrapy Change extensions/spiders/settings initialisation order

This is an implementation of @chekunkov's suggested changes from #1305. I will be uploading consecutive commits one-by-one in the next minutes so we can see which tests break at what point.

Nov 02 '15 15:11 jdemaeyer

1: Moving extension's __init__ into Crawler.crawl()

Two tests fail because they essentially test the implementation that we just changed. The functionality they're testing (spider settings) should still be intact.

Nov 02 '15 15:11 jdemaeyer

2: Moving Spider.update_settings() into Crawler.crawl()

Three new tests failing because we're trying to close a Downloader that is not running. This could also be an issue with the new ExecutionEngine.close() method?

One of the previously failing tests now fails earlier because the spider settings are not applied in __init__()

It is no longer possible to change the STATS_CLASS from the spider at this point.

Nov 02 '15 15:11 jdemaeyer

3: Initialise spider before calling its update_settings() method

I have not yet made update_settings() an instance method.

No new failing tests.

Nov 02 '15 15:11 jdemaeyer

4: Allow update_settings() to be either class or instance method

No new failing tests.

Nov 02 '15 16:11 jdemaeyer

The tests that fail with "Tried to stop a LoopingCall that was not running." all have two calls to Crawler.crawl().

I think the real problem is that we're trying to write to frozen settings (or that we try to freeze settings twice). The error then manifests because we hit the except clause where the self.engine.close() call tries to close a pre-existing (and already finished) engine.

These three tests pass when I comment out the self.settings.freeze() line.

Nov 02 '15 16:11 jdemaeyer

This initialisation order would be great for #1442 (add-on callbacks for spiders) because the spider arguments could easily be mapped into the spider's add-on config.

Nov 02 '15 16:11 jdemaeyer

It is no longer possible to change the STATS_CLASS from the spider at this point.

as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?

Nov 02 '15 18:11 chekunkov

as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?

SPIDER_LOADER_CLASS is another setting that can already not be changed from the spider. (Of course that seems more intuitive than not being able to set log settings).

I don't really see a smart way to keep supporting changing these settings from the spider. It makes sense to set stats and logging up as soon as the crawler is initialized, so I don't think we can move that to crawl(). On the other hand, we cannot move the spider initialisation out of crawl() because we don't have access to the spider args before.

For backwards compatibility we could implement a check whether update_settings() is a class method in __init__(). If it is, we call it. If not, we call it in crawl(). Not really fond of this though :/

Nov 03 '15 22:11 jdemaeyer

What happens between Crawler.__init__ and crawler.crawl() in a common case (create a crawler in CrawlerRunner/CrawlerProcess)? Is there anything seriuos which prevents us from moving things from __init__ to crawl freely?

Nov 03 '15 22:11 kmike

In other words, why does crawler need a logger and stats as soon as it is initialized? Can it do anything useful before .crawl() call?

Nov 03 '15 22:11 kmike

The ExecutionEngine copies crawler.logformatter. So problems arise when the engine is created outside of crawl(), like scrapy shell does

There are many failing tests when I move the stats & log init into crawl() (see last commit). Quite a few of these test downloader middlewares and create a crawler instance, immediately give it to the from_crawler() method, then fail because there's no crawler.stats attribute. However, in the real Scrapy flow stats would exist when from_crawler() is called.

Nov 03 '15 22:11 jdemaeyer

In other words, why does crawler need a logger and stats as soon as it is initialized?

@kmike One thing which looks important to me - setup logging before Spider.__init__ - I still remember times when it wasn't possible to log messages during spider initialization and that wasn't good - some spiders are doing quite complex configuration setup there and logging is essential.

On the other hand, we cannot move the spider initialisation out of crawl() because we don't have access to the spider args before.

@jdemaeyer Maybe it makes sense to pass spider args to Crawler.__init__ - so that spider can be instantiated, settings filled and frozen - and Crawler.crawl will be responsible only for creating engine and starting spider with start_requests? crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.

Nov 04 '15 09:11 chekunkov

One thing which looks important to me - setup logging before Spider.init

Spider.__init__ is called in Crawler.crawl, so if we move logging setup to Crawler.crawl then Spider.__init__ can still log messages.

Nov 04 '15 10:11 kmike

Yes, but it we keep logging setup before Spider.__init__ - we cannot change LOG_LEVEL and LOG_FORMATTER from spider. If we move logging setup after Spider.__init__ - we can change those settings but logging won't work from Spider.__init__. I think first option is better, but I don't like the idea of having special-case settings which cannot be changed from Spider.__init__ because they are used to setup some object before spider creation. On the other hand @jdemaeyer is right and we already have such special cases (SPIDER_LOADER_CLASS) so maybe that's not a big problem.

I might regret telling that, but now I'm not sure it's a good idea to do settings configuration during/after Spider.__init__. Telling that I have an idea to share. If we have access to spider arguments in Cralwer.__init__ - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?

Nov 04 '15 11:11 chekunkov

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

Nov 04 '15 11:11 chekunkov

If we have access to spider arguments in Cralwer.init - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?

That would work for add-ons. I think changing the Crawler API is quite delicate though. If we don't do it backwards-compatible it's gonna break everything that runs spiders by other means than the Scrapy command line tool. If we do it backwards-compatible, e.g. by giving the spider args to both __init__() and crawl(), it feels somewhat dirty (spiders that use the spider args in their update_settings() class method will behave different depending on whether the running tool implemented the new API or not).

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

An issue with this is that the core settings could then not be changed in stand-alone spiders (except through the command line). This is still the option I like most though.

crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.

Hmm, but running the same spider twice (in the same crawler) with different spider args is still supported (at least some tests do it).

Nov 05 '15 22:11 jdemaeyer

Some settings doesn't even makes sense to be changed on the Spider. LOG_LEVEL, you can create a new logger and use that; SPIDER_LOADER_CLASS, you are already on the spider. Having core settings is too much effort, for something that doesn't makes sense to be changed at that stage.

Maintain backwards compatibility to people using Scrapy as a library doesn't makes too much sense to me, and most if isn't even documented. If feel this people doesn't need to update too often either because they aren't using most of the sugar we add. About patches on Downloader and things like that, they can easily monkey patch those.

Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.

Nov 06 '15 15:11 nramirezuy

Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.

I like the idea of separating settings, at least documenting what can be changed per-spider. There is a few other settings which are global, e.g. thread pool size.

Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.

+1

Nov 06 '15 15:11 kmike

Current coverage is `82.95%`

Merging #1580 into master will increase coverage by +0.07% as of e238733

Powered by Codecov. Updated on successful CI builds.

Nov 09 '15 17:11 codecov-io

If we can settle for:

some settings being unchangeable from the spider (that shouldn't really be per-spider anyways, essentially only logging and stats ones), and
a little backwards incompatibility (Crawler.extensions not available after Crawler.__init__())

then initialising the spider and extensions in crawl(), as originally proposed in #1305, seems the most elegant solution to me. Bonus points: There are no changes to the call signatures of Crawler. The PR is now back to this state (i.e. it only moves three lines from __init__() to crawl() in Crawler).

I've overhauled the tests that had multiple calls to the same crawler's crawl() (see #1587), and updated the spider settings tests.

I wonder if we should make Spider.update_settings() an instance method instead of a class method. This:

class MySpider(Spider):
    def __init__(self, *args, **kwargs):
        self.custom_settings.update({'XY': 'blah'})

won't work otherwise (since update_settings() will read the class's custom_settings, not the instance's one).

Nov 09 '15 17:11 jdemaeyer

@jdemaeyer That last change would totally fix using custom_settings when sub classing.

Nov 09 '15 21:11 nramirezuy

some settings being unchangeable from the spider

sounds ok if we document this properly

a little backwards incompatibility (Crawler.extensions not available after Crawler.init())

sounds ok as well - don't have an idea where crawler.extensions might be needed before crawler.crawl()

I wonder if we should make Spider.update_settings() an instance method instead of a class method

in addition to your example - it will be possible to update settings this way

class MySpider(Spider):
    def update_settings(self, settings):
        # for simplicity - imagine that value is validated in __init__
        if self.close_after:
            # intentionally omitted priority='spider'
            settings.set('CLOSESPIDER_TIMEOUT', self.close_after)

so one can ignore custom_settings and work with settings object directly - I would definitely use that instead of custom_settings because it's simpler and gives much more flexibility.

Nov 09 '15 21:11 chekunkov

@kmike this is fairly stale at this point. Given the 7 years of changes on the tool, is this still a relevant PR, or could we close it?

Nov 17 '22 10:11 felipeboffnunes

Wow, what a sweet rush of nostalgia washing over me :sweat_smile:

Nov 17 '22 10:11 jdemaeyer

scrapy scrapy copied to clipboard

Change extensions/spiders/settings initialisation order

Current coverage is 82.95%

scrapy
scrapy copied to clipboard

Current coverage is `82.95%`