frontera icon indicating copy to clipboard operation
frontera copied to clipboard

Prioritize command line option for SPIDER_PARTITION_ID

Open lljrsr opened this issue 9 years ago • 12 comments

Right now frontera recommends setting the PARTITION_ID in a separate python settings file for each spider / worker. However when shipping out the project it would be nice to have a command line option to pass either a config file or the PARTITION_ID of the worker/spider. The separate settings file would then no longer be needed, which would make frontera more flexible and easier to deploy and use. Since supporting config files might need big changes in the project I recommend adding a command line option to choose the PARTIOTION_ID. Do you think that would be a good addition? Is there already something available, so this feature would not be needed?

lljrsr avatar Feb 03 '16 16:02 lljrsr

Hey @lljrsr long time no see ;) Try

$scrapy crawl [your spider] -s SPIDER_PARTITION_ID=[number]

it should work, because it's possible to configure Frontera using Scrapy settings (see docs for more details.)

sibiryakov avatar Feb 03 '16 16:02 sibiryakov

Hi. Yes I had lots of other stuff to do :) .

scrapy crawl [my spider] -s FRONTERA_SETTINGS=[my project].frontier.spider_settings -s SPIDER_PARTITION_ID=0

..does not work. It throws:

exceptions.TypeError: int() argument must be a string or a number, not 'NoneType'

..when trying to use the partition_id

lljrsr avatar Feb 03 '16 17:02 lljrsr

Option values isn't passing. Well, can you investigate that? The same thing without -s FRONTERA_SETTINGS ?

sibiryakov avatar Feb 04 '16 18:02 sibiryakov

Yes, it throws the same error when I use:

scrapy crawl [my spider] -s SPIDER_PARTITION_ID=0

My guess is that there is a difference between scrapy settings (e.g. SEEDS_SOURCE, FRONTERA_SETTINGS) and frontera settings (e.g. ZMQ_HOSTNAME, SPIDER_PARTITION_ID) and it is not possible to pass frontera settings.

EDIT I found out that you use a the scrapy settings class e.g. in this file. In this file for example you use a frontera settings class. (I just added print settings after those lines to compare them)

lljrsr avatar Feb 05 '16 14:02 lljrsr

It's connected with this https://github.com/scrapinghub/frontera/pull/105

sibiryakov avatar Feb 11 '16 11:02 sibiryakov

With the newest update it now uses the correct SPIDER_PARTITION_ID in messagebus.py. However it still throws an error (but a different one):

...
  File "/home/jrisr/Crawl/debug/frontera/frontera/core/manager.py", line 24, in __init__
    self._backend = self._load_backend(backend, db_worker, strategy_worker)
  File "/home/jrisr/Crawl/debug/frontera/frontera/core/manager.py", line 62, in _load_backend
    return cls.from_manager(self)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/backends/remote/messagebus.py", line 28, in from_manager
    return clas(manager)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/backends/remote/messagebus.py", line 21, in __init__
    self.consumer = spider_feed.consumer(partition_id=self.partition_id)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/messagebus/zeromq/__init__.py", line 179, in consumer
    return Consumer(self.context, self.out_location, partition_id, 'sf', seq_warnings=True, hwm=self.consumer_hwm)
  File "/home/jrisr/Crawl/debug/frontera/frontera/contrib/messagebus/zeromq/__init__.py", line 21, in __init__
    filter = identity + pack('>B', partition_id) if partition_id is not None else identity
struct.error: cannot convert argument to integer

This is probably because self.partition_id of MessageBusBackend is a string when passing it via the command line option.

lljrsr avatar Feb 12 '16 14:02 lljrsr

https://github.com/scrapinghub/frontera/pull/110

sibiryakov avatar Feb 12 '16 19:02 sibiryakov

Should be fine now. Please reopen in case of problems.

sibiryakov avatar Feb 12 '16 19:02 sibiryakov

Passing the settings via command line works now, but the settings.py takes precedence over the command line options, which should not be the case according to scrapy docs. I would like to reopen this issue but either I do not know how or I am not able to :P .

lljrsr avatar Feb 15 '16 11:02 lljrsr

FRONTERA_SETTINGS module isn't connected with Scrapy anyhow, so Frontera's settings have precedence. http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#frontier-scrapy-settings

sibiryakov avatar Feb 15 '16 11:02 sibiryakov

Okay. So the FRONTERA_SETTINGS have precedence over all the scrapy settings (including the command line settings). In my opinion it would be a good idea to mention that in the docs. However I think this is a strange design. Command line settings usually have the highest priority since it provides an easy way for a user to try out some values, before storing them in a config file.

lljrsr avatar Feb 15 '16 11:02 lljrsr

It is in the docs: http://frontera.readthedocs.org/en/latest/topics/scrapy-integration.html#defining-frontier-settings-via-scrapy-settings

Frontera is designed in a way to be used independently from Scrapy, so it happened historically Frontera has it's own settings. At the moment settings in Scrapy evolved, and it's possible to designate which of them are set using command line, therefore prioritizing cmd line over FRONTERA_SETTINGS can be done, and I think makes sense.

sibiryakov avatar Feb 15 '16 11:02 sibiryakov