cdx_toolkit icon indicating copy to clipboard operation
cdx_toolkit copied to clipboard

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Results 14 cdx_toolkit issues
Sort by recently updated
recently updated
newest added

This PR adds a Github action that runs the example script of the CC CDX index server every week.

The PR adds two new commands to the CLI: - `filter_cdx`: Filter CDX files based on a given URL or SURT white list. - `warc_by_cdx`: Fetch WARC files like `warc`...

Querying the index brings back a `status, timestamp, url` triple, e.g.: ```text $ cdxt --cc --crawl CC-MAIN-2025-43 iter 'commoncrawl.org/get-started' status 200, timestamp 20251014220259, url https://www.commoncrawl.org/get-started status 200, timestamp 20251016192109, url...

This affects Common Crawl. The current algorithm guesses the dates and gets it wrong by a week or two.