autoextract-spiders icon indicating copy to clipboard operation
autoextract-spiders copied to clipboard

Pre-built Scrapy spiders for AutoExtract

Results 5 autoextract-spiders issues
Sort by recently updated
recently updated
newest added

Updates the requirements on [pyyaml](https://github.com/yaml/pyyaml) to permit the latest version. Updates `pyyaml` to 6.0.1 Changelog Sourced from pyyaml's changelog. 6.0.1 (2023-07-18) yaml/pyyaml#702 -- pin Cython build dep to < 3.0...

dependencies

See issue https://github.com/scrapinghub/autoextract-spiders/issues/6 Usage: > scrapy crawl articles -a seeds=... -a dates=2019-11 ... Or a list of dates: > scrapy crawl articles -a seeds=... -a dates=['2019-09', '2019-10'] ... Any rule...

When discovering URLs from different seeds, the URLs are not deduplicated if they are found in multiple seeds. There is local de-duplication during discovery, and there's also the built-in DupeFilters....

It's useful to expose a param in the spider to only keep articles that match a certain date. This could be as simple as a regex, to match agains the...

enhancement