scrapyd
scrapyd copied to clipboard
Configurable spider queue class
I have started to implement a custom job queue so i can use a shared job queue with Postgre. I tried to use this setting SPIDER_QUEUE_CLASS but i found out that its missing from Scrapyd. Is it possible to have this setting back cause i want to avoid patch scrapyd code? I think implementation of this feature again, it is important when someone want to use Scrapyd in a multi-server environment.
I summarize here my reply from the mailing list (only for the record).
The SPIDER_QUEUE_CLASS
setting
is from the time that scrapyd was part of scrapy
not only as a package but as a module.
https://github.com/scrapy/scrapy/commit/75e2c3eb338ea03e487907fa8c99bb12317e9435
Unfortunately the release notes do not cover all the details of scrapyd's separation as a module
(and it would probably be impractical)
Scrapyd never had this setting.
It seems easy to implement such a setting.
diff --git a/scrapyd/default_scrapyd.conf b/scrapyd/default_scrapyd.conf
index 0da344f..e2b0c35 100644
--- a/scrapyd/default_scrapyd.conf
+++ b/scrapyd/default_scrapyd.conf
@@ -15,4 +15,5 @@ runner = scrapyd.runner
application = scrapyd.app.application
launcher = scrapyd.launcher.Launcher
+spiderqueue = scrapyd.spiderqueue.SqliteSpiderQueue
webroot = scrapyd.website.Root
diff --git a/scrapyd/utils.py b/scrapyd/utils.py
index 602a726..c96add0 100644
--- a/scrapyd/utils.py
+++ b/scrapyd/utils.py
@@ -9,8 +9,11 @@ import json
from twisted.web import resource
-from scrapyd.spiderqueue import SqliteSpiderQueue
+from scrapy.utils.misc import load_object
from scrapyd.config import Config
+DEFAULT_SPIDERQUEUE = 'scrapyd.spiderqueue.SqliteSpiderQueue'
+
+
class JsonResource(resource.Resource):
@@ -57,8 +60,9 @@ def get_spider_queues(config):
if not os.path.exists(dbsdir):
os.makedirs(dbsdir)
+ spiderqueue = load_object(config.get('spiderqueue', DEFAULT_SPIDERQUEUE))
d = {}
for project in get_project_list(config):
dbpath = os.path.join(dbsdir, '%s.db' % project)
- d[project] = SqliteSpiderQueue(dbpath)
+ d[project] = spiderqueue(dbpath)
return d
I notice that the builtin spider queue should be merged with the sqlite priority queue which may also be eventually replaced by https://github.com/scrapy/queuelib Such a plan shouldn't prevent us from introducing the above config option as long as the interface remains the same.
An update on my last comment about replacing this queue module with scrapy/queuelib. There are more projects to consider. https://github.com/balena/python-pqueue https://github.com/peter-wangxu/persist-queue
Discussion of replacement to default spider queue moved to #475
Noting that this issue was postponed "in favour of https://github.com/scrapy/scrapyd/issues/187/3rd solution (Unify queues/dbs)" https://github.com/scrapy/scrapyd/pull/201#issuecomment-485489769
However, #187 has gone nowhere since 2016, and if we do decide to make breaking changes, we can just do that in a major version. So, this is no longer postponed.