cdx_toolkit icon indicating copy to clipboard operation
cdx_toolkit copied to clipboard

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Results 14 cdx_toolkit issues
Sort by recently updated
recently updated
newest added

``` cdxt --cc --from 2021 --to 2020 -v -v --limit 1 iter https://www.pbm.com/ INFO:cdx_toolkit.cli:set loglevel to DEBUG DEBUG:cdx_toolkit.myrequests:getting https://index.commoncrawl.org/collinfo.json None DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): index.commoncrawl.org:443 DEBUG:urllib3.connectionpool:https://index.commoncrawl.org:443 "GET /collinfo.json HTTP/1.1"...

Hi, Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We...

Added chapter about caching the results using requests_cache library.

Bumps [requests](https://github.com/psf/requests) from 2.25.1 to 2.32.0. Release notes Sourced from requests's releases. v2.32.0 2.32.0 (2024-05-20) 🐍 PYCON US 2024 EDITION 🐍 Security Fixed an issue where setting verify=False on the...

dependencies

requests has the ability to maintain and reuse TCP sessions across requests. https://requests.readthedocs.io/en/latest/user/advanced/#session-objects

This works: `$ cdxt --cc --limit 1 iter www.pbm.com/* --all-fields` This does not: `$ cdxt --cc --limit 1 iter www.pbm.com/* --all-fields --json`

This PR integrates a couple of general changes from the EOT PR (https://github.com/cocrawler/cdx_toolkit/pull/54): - Settings variables are loaded from environment variables in `settings.py` - Common CLI methods are moved to...

There is currently no option to rename an outputted WARC. For example, running the following command creates a file containing one record called `TEST-000000.extracted.warc.gz` ```bash cdxt --cc --crawl CC-MAIN-2025-43 --from...