autoextract-spiders issues

Results 5 autoextract-spiders issues

Sort by recently updated

Update pyyaml requirement from <=3.13,>=3.10 to >=3.10,<=6.0.1 in the pip group across 1 directory

Updates the requirements on [pyyaml](https://github.com/yaml/pyyaml) to permit the latest version. Updates `pyyaml` to 6.0.1 Changelog Sourced from pyyaml's changelog. 6.0.1 (2023-07-18) yaml/pyyaml#702 -- pin Cython build dep to < 3.0...

dependabot[bot]

dependencies

Implemented date filter rules, specified as spider arg

See issue https://github.com/scrapinghub/autoextract-spiders/issues/6 Usage: > scrapy crawl articles -a seeds=... -a dates=2019-11 ... Or a list of dates: > scrapy crawl articles -a seeds=... -a dates=['2019-09', '2019-10'] ... Any rule...

croqaz

Better de-duplication of URLs

When discovering URLs from different seeds, the URLs are not deduplicated if they are found in multiple seeds. There is local de-duplication during discovery, and there's also the built-in DupeFilters....

croqaz

Filter extracted articles by date

It's useful to expose a param in the spider to only keep articles that match a certain date. This could be as simple as a regex, to match agains the...

croqaz

enhancement

It adds Fake UserAgent support

rafaelcapucho

autoextract-spiders
autoextract-spiders copied to clipboard

Metadata

Update pyyaml requirement from <=3.13,>=3.10 to >=3.10,<=6.0.1 in the pip group across 1 directory

Implemented date filter rules, specified as spider arg

Better de-duplication of URLs

Filter extracted articles by date

It adds Fake UserAgent support

← Metadata

Owner

Metadata

autoextract-spiders autoextract-spiders copied to clipboard

Metadata

Update pyyaml requirement from <=3.13,>=3.10 to >=3.10,<=6.0.1 in the pip group across 1 directory

Implemented date filter rules, specified as spider arg

Better de-duplication of URLs

Filter extracted articles by date

It adds Fake UserAgent support

← Metadata

Owner

Metadata

autoextract-spiders
autoextract-spiders copied to clipboard