scrapy-redis
scrapy-redis copied to clipboard
Item pipelines are slowed down
I'm using scrapy-redis to support recovery from interruptions, but the speed of item pipelines is significantly slowed down. Please look into the url below for further details, thanks for your attention! https://stackoverflow.com/questions/63026873/scarpy-redis-slows-down-item-pipelines
How does your pipeline look like? Are you using non-blocking operations?
I have used deferToThread as the pipeline.py of scrapy-redis's source code does, but it didn't work. However, I have a new finding. I try to implement my own dupefilter based on Redis. When I overload Scrapy's dupefilter in settings.py, and add url finger to Redis in the process_response method of a DownloaderMiddleWare, the item pipeline is still very slow. Therefore I guess the low speed is because of the interface of DUPEFILTER_CLASS provided by Scrapy, but not scrapy-redis. Thus I try to move all the deduplication into the DownloaderMiddleWare. I check url finger in process_request, add url finger in process_response, and now the item pipeline's speed returns to normal. I'm not quite familiar with Scrapy, so I can't tell the mechanism behind the problem. If you have an explanation or a better solution, please let me know. Thanks a lot.