scrapy-crawl-once
scrapy-crawl-once copied to clipboard
Scrapy middleware which allows to crawl only new content
scrapy-crawl-once has no built-in way of clearing out all seen requests via settings.
Aimed to fix #4 Added setting similar to DELTAFETCH_RESET Expected usage: in settings.py: `CRAWL_ONCE_RESET = True` or in terminal: `scrapy crawl spider_name -a crawl_once_reset=True` If True, `SqliteDict.clear()` is called on...
TODO: * [x] DB object * [ ] allow to inject DB to callbacks * [ ] tests * [x] docs * [ ] check if old Pythons need to...
I have a spider crawl only detail pages and they are never skipped by this middleware.
Fixes Issue #6
After upgraded scrapy, The follow warning occurs on every request that uses crawl_once: ``` 2022-10-28 15:54:21 [py.warnings] WARNING: /scrapyd/venv/lib/python3.9/site-packages/scrapy_crawl_once/middlewares.py:96: ScrapyDeprecationWarning: Call to deprecated function scrapy.utils.request.request_fingerprint(). If you are using this...