scrapy-redis icon indicating copy to clipboard operation
scrapy-redis copied to clipboard

Is there a way to stop spider check duplicate with redis ?

Open milkeasd opened this issue 2 years ago • 7 comments

My spider was extremely slow when run with scrapy-redis. Because there is a big delay between slave and master. I want to reduce the commuication to just only getting the start_urls periodically or when all start_urls is done, Is there any ways to do so ?

Moreover, I want to stop the duplication check to reduce the number of connection.

But, I cant change the DUPEFILTER_CLASS to scrapy default one, it raise error.

Is there any other ways to stop the duplicate check ?

Or any ideas can help speed up the process ?

Thanks

milkeasd avatar Apr 02 '22 20:04 milkeasd

@Germey Any ideas?

LuckyPigeon avatar Apr 03 '22 02:04 LuckyPigeon

@milkeasd Could you provide related code files?

LuckyPigeon avatar Apr 03 '22 03:04 LuckyPigeon

The way I see, let developer customize their communication rules and add a disable option for DUPEFILTER_CLASS can be two great features.

LuckyPigeon avatar Apr 03 '22 05:04 LuckyPigeon

@milkeasd For disable DUPEFILTER_CLASS, try this https://stackoverflow.com/questions/23131283/how-to-force-scrapy-to-crawl-duplicate-url

LuckyPigeon avatar Apr 08 '22 16:04 LuckyPigeon

@milkeasd could you please provide your code or make some sample code?

Germey avatar Apr 09 '22 06:04 Germey