scrapyd icon indicating copy to clipboard operation
scrapyd copied to clipboard

SQLite queue is using all CPU on high frequency poller (<1s)

Open pspsdev opened this issue 2 years ago • 12 comments

When running spiders that do nothing at all, the sqlite based poller uses all cpu just reading scheduled tasks. It would be good to have a plug and play alternative queues like redis.

pspsdev avatar Mar 08 '23 13:03 pspsdev

Related: https://github.com/scrapy/scrapyd/issues/197

pspsdev avatar Mar 08 '23 14:03 pspsdev

Why are you running spiders that "do nothing at all"?

jpmckinney avatar Mar 08 '23 15:03 jpmckinney

@jpmckinney just to rule out that cpu is being used by a spider. This can be replicated when scheduling a lot of jobs and polling rate is below a second e.g 0.1. SQLite queue will use massive ammount of cpu.

pspsdev avatar Mar 08 '23 15:03 pspsdev

There are also some unmaintained repos that tries to solve this: https://github.com/speakol-ads/scrapyd-redis

Simply the sqlite queue is a really bad option for high frequency queues.

pspsdev avatar Mar 08 '23 15:03 pspsdev

Hmm, yeah, same with https://github.com/Tiago-Lira/scrapyd-mongodb (from which scrapyd-redis is forked) and https://github.com/balena/python-pqueue (mentioned in #197).

https://github.com/peter-wangxu/persist-queue is still active, though maybe a first attempt is to switch to https://github.com/scrapy/queuelib as mentioned in #197.

Can you share your setup for reproducing the issue?

jpmckinney avatar Mar 08 '23 15:03 jpmckinney

I will try to create a demo later, but it's pretty much can be empty scrapyd service running with 1 spider that does nothing. Then creating like 50 schedules per second and making polling rate 0.1. It will destroy powerful cpu.

pspsdev avatar Mar 08 '23 15:03 pspsdev

Also, in my personal opinion I would say it would make sense to add interface to add your own queue backend instead of doing hacks like those 2 repos mentioned above.

pspsdev avatar Mar 08 '23 15:03 pspsdev

And then later sqlite can be switch to some other default is needed, but having a simple method to replace the queue on your own would be a very good option to quick solve this problem for those who use high frequency polling

pspsdev avatar Mar 08 '23 15:03 pspsdev

Do you have your own queue ready to use? You can try it with this PR: https://github.com/scrapy/scrapyd/pull/476

jpmckinney avatar Mar 08 '23 16:03 jpmckinney

@jpmckinney thanks, give me a few hours I will try it out.

pspsdev avatar Mar 08 '23 16:03 pspsdev

@pspsdev Now that #476 is merged, do you have suggestions for how to edit the default spider queue, or should there just be a note in the documentation that it doesn't perform well under high frequency polling, and a custom queue would be better?

jpmckinney avatar Mar 10 '23 16:03 jpmckinney

@jpmckinney I am still doing some tests on my end, give me a few days I will report with more details.

pspsdev avatar Mar 10 '23 16:03 pspsdev