scrapy
scrapy copied to clipboard
Change extensions/spiders/settings initialisation order
This is an implementation of @chekunkov's suggested changes from #1305. I will be uploading consecutive commits one-by-one in the next minutes so we can see which tests break at what point.
1: Moving extension's __init__
into Crawler.crawl()
Two tests fail because they essentially test the implementation that we just changed. The functionality they're testing (spider settings) should still be intact.
2: Moving Spider.update_settings()
into Crawler.crawl()
Three new tests failing because we're trying to close a Downloader
that is not running. This could also be an issue with the new ExecutionEngine.close()
method?
One of the previously failing tests now fails earlier because the spider settings are not applied in __init__()
It is no longer possible to change the STATS_CLASS
from the spider at this point.
3: Initialise spider before calling its update_settings()
method
I have not yet made update_settings()
an instance method.
No new failing tests.
4: Allow update_settings()
to be either class or instance method
No new failing tests.
The tests that fail with "Tried to stop a LoopingCall that was not running." all have two calls to Crawler.crawl()
.
I think the real problem is that we're trying to write to frozen settings (or that we try to freeze settings twice). The error then manifests because we hit the except
clause where the self.engine.close()
call tries to close a pre-existing (and already finished) engine
.
These three tests pass when I comment out the self.settings.freeze()
line.
This initialisation order would be great for #1442 (add-on callbacks for spiders) because the spider arguments could easily be mapped into the spider's add-on config.
It is no longer possible to change the STATS_CLASS from the spider at this point.
as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?
as well as LOG_LEVEL and LOG_FORMATTER. don't thinks they are often changed from spider but this is inconsistent behaviour, and it sucks. is there any better way?
SPIDER_LOADER_CLASS
is another setting that can already not be changed from the spider. (Of course that seems more intuitive than not being able to set log settings).
I don't really see a smart way to keep supporting changing these settings from the spider. It makes sense to set stats and logging up as soon as the crawler is initialized, so I don't think we can move that to crawl()
. On the other hand, we cannot move the spider initialisation out of crawl()
because we don't have access to the spider args before.
For backwards compatibility we could implement a check whether update_settings()
is a class method in __init__()
. If it is, we call it. If not, we call it in crawl()
. Not really fond of this though :/
What happens between Crawler.__init__
and crawler.crawl()
in a common case (create a crawler in CrawlerRunner/CrawlerProcess)? Is there anything seriuos which prevents us from moving things from __init__
to crawl
freely?
In other words, why does crawler need a logger and stats as soon as it is initialized? Can it do anything useful before .crawl()
call?
The ExecutionEngine
copies crawler.logformatter
. So problems arise when the engine is created outside of crawl()
, like scrapy shell
does
There are many failing tests when I move the stats & log init into crawl()
(see last commit). Quite a few of these test downloader middlewares and create a crawler instance, immediately give it to the from_crawler()
method, then fail because there's no crawler.stats
attribute. However, in the real Scrapy flow stats
would exist when from_crawler()
is called.
In other words, why does crawler need a logger and stats as soon as it is initialized?
@kmike One thing which looks important to me - setup logging before Spider.__init__
- I still remember times when it wasn't possible to log messages during spider initialization and that wasn't good - some spiders are doing quite complex configuration setup there and logging is essential.
On the other hand, we cannot move the spider initialisation out of
crawl()
because we don't have access to the spider args before.
@jdemaeyer Maybe it makes sense to pass spider args to Crawler.__init__
- so that spider can be instantiated, settings filled and frozen - and Crawler.crawl
will be responsible only for creating engine and starting spider with start_requests
? crawl()
accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler
instance. Now it's not supported.
One thing which looks important to me - setup logging before Spider.init
Spider.__init__
is called in Crawler.crawl
, so if we move logging setup to Crawler.crawl
then Spider.__init__
can still log messages.
Yes, but it we keep logging setup before Spider.__init__
- we cannot change LOG_LEVEL and LOG_FORMATTER from spider. If we move logging setup after Spider.__init__
- we can change those settings but logging won't work from Spider.__init__
. I think first option is better, but I don't like the idea of having special-case settings which cannot be changed from Spider.__init__
because they are used to setup some object before spider creation. On the other hand @jdemaeyer is right and we already have such special cases (SPIDER_LOADER_CLASS
) so maybe that's not a big problem.
I might regret telling that, but now I'm not sure it's a good idea to do settings configuration during/after Spider.__init__
. Telling that I have an idea to share. If we have access to spider arguments in Cralwer.__init__
- we can pass them to classmethod update_settings(settings, spider_args)
. In this case developer can override Spider.update_settings
and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?
Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.
If we have access to spider arguments in Cralwer.init - we can pass them to classmethod update_settings(settings, spider_args). In this case developer can override Spider.update_settings and change settings based on spider arguments - that's basically what I wanted in #1305. Does it work for addons?
That would work for add-ons. I think changing the Crawler
API is quite delicate though. If we don't do it backwards-compatible it's gonna break everything that runs spiders by other means than the Scrapy command line tool. If we do it backwards-compatible, e.g. by giving the spider args to both __init__()
and crawl()
, it feels somewhat dirty (spiders that use the spider args in their update_settings()
class method will behave different depending on whether the running tool implemented the new API or not).
Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.
An issue with this is that the core settings could then not be changed in stand-alone spiders (except through the command line). This is still the option I like most though.
crawl() accepts spider args because of historical reasons - it was possible to run several spiders using the same Crawler instance. Now it's not supported.
Hmm, but running the same spider twice (in the same crawler) with different spider args is still supported (at least some tests do it).
Some settings doesn't even makes sense to be changed on the Spider
. LOG_LEVEL
, you can create a new logger and use that; SPIDER_LOADER_CLASS
, you are already on the spider. Having core settings is too much effort, for something that doesn't makes sense to be changed at that stage.
Maintain backwards compatibility to people using Scrapy
as a library doesn't makes too much sense to me, and most if isn't even documented. If feel this people doesn't need to update too often either because they aren't using most of the sugar we add. About patches on Downloader
and things like that, they can easily monkey patch those.
Running 2 spiders on the same Crawler
is something that we don't support, if there are hacks to do so, we certainly should even care about.
Another option is to do better job in separating "core" settings which must not be changed by spider from extensions/middlewares/pipelines/addons settings that can be changed by spider.
I like the idea of separating settings, at least documenting what can be changed per-spider. There is a few other settings which are global, e.g. thread pool size.
Running 2 spiders on the same Crawler is something that we don't support, if there are hacks to do so, we certainly should even care about.
+1
Current coverage is 82.95%
Merging #1580 into master will increase coverage by +0.07% as of
e238733
Powered by Codecov. Updated on successful CI builds.
If we can settle for:
- some settings being unchangeable from the spider (that shouldn't really be per-spider anyways, essentially only logging and stats ones), and
- a little backwards incompatibility (
Crawler.extensions
not available afterCrawler.__init__()
)
then initialising the spider and extensions in crawl()
, as originally proposed in #1305, seems the most elegant solution to me. Bonus points: There are no changes to the call signatures of Crawler
. The PR is now back to this state (i.e. it only moves three lines from __init__()
to crawl()
in Crawler
).
I've overhauled the tests that had multiple calls to the same crawler's crawl()
(see #1587), and updated the spider settings tests.
I wonder if we should make Spider.update_settings()
an instance method instead of a class method. This:
class MySpider(Spider):
def __init__(self, *args, **kwargs):
self.custom_settings.update({'XY': 'blah'})
won't work otherwise (since update_settings()
will read the class's custom_settings
, not the instance's one).
@jdemaeyer That last change would totally fix using custom_settings when sub classing.
some settings being unchangeable from the spider
sounds ok if we document this properly
a little backwards incompatibility (Crawler.extensions not available after Crawler.init())
sounds ok as well - don't have an idea where crawler.extensions
might be needed before crawler.crawl()
I wonder if we should make Spider.update_settings() an instance method instead of a class method
in addition to your example - it will be possible to update settings this way
class MySpider(Spider):
def update_settings(self, settings):
# for simplicity - imagine that value is validated in __init__
if self.close_after:
# intentionally omitted priority='spider'
settings.set('CLOSESPIDER_TIMEOUT', self.close_after)
so one can ignore custom_settings
and work with settings object directly - I would definitely use that instead of custom_settings
because it's simpler and gives much more flexibility.
@kmike this is fairly stale at this point. Given the 7 years of changes on the tool, is this still a relevant PR, or could we close it?
Wow, what a sweet rush of nostalgia washing over me :sweat_smile: