autoextract-spiders
autoextract-spiders copied to clipboard
Pre-built Scrapy spiders for AutoExtract
Updates the requirements on [pyyaml](https://github.com/yaml/pyyaml) to permit the latest version. Updates `pyyaml` to 6.0.1 Changelog Sourced from pyyaml's changelog. 6.0.1 (2023-07-18) yaml/pyyaml#702 -- pin Cython build dep to < 3.0...
See issue https://github.com/scrapinghub/autoextract-spiders/issues/6 Usage: > scrapy crawl articles -a seeds=... -a dates=2019-11 ... Or a list of dates: > scrapy crawl articles -a seeds=... -a dates=['2019-09', '2019-10'] ... Any rule...
When discovering URLs from different seeds, the URLs are not deduplicated if they are found in multiple seeds. There is local de-duplication during discovery, and there's also the built-in DupeFilters....
It's useful to expose a param in the spider to only keep articles that match a certain date. This could be as simple as a regex, to match agains the...