cdx_toolkit
cdx_toolkit copied to clipboard
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
This PR adds a Github action that runs the example script of the CC CDX index server every week.
The PR adds two new commands to the CLI: - `filter_cdx`: Filter CDX files based on a given URL or SURT white list. - `warc_by_cdx`: Fetch WARC files like `warc`...
Querying the index brings back a `status, timestamp, url` triple, e.g.: ```text $ cdxt --cc --crawl CC-MAIN-2025-43 iter 'commoncrawl.org/get-started' status 200, timestamp 20251014220259, url https://www.commoncrawl.org/get-started status 200, timestamp 20251016192109, url...
This affects Common Crawl. The current algorithm guesses the dates and gets it wrong by a week or two.