cdx_toolkit issues

Results 14 cdx_toolkit issues

Sort by recently updated

feat: Adding Github action to check CC's CDX index server every week

This PR adds a Github action that runs the example script of the CC CDX index server every week.

feat: Adding filter_cdx and warc_by_cdx commands (2)

The PR adds two new commands to the CLI: - `filter_cdx`: Filter CDX files based on a given URL or SURT white list. - `warc_by_cdx`: Fetch WARC files like `warc`...

malteos

enhancement: select crawl by exact timestamp

Querying the index brings back a `status, timestamp, url` triple, e.g.: ```text $ cdxt --cc --crawl CC-MAIN-2025-43 iter 'commoncrawl.org/get-started' status 200, timestamp 20251014220259, url https://www.commoncrawl.org/get-started status 200, timestamp 20251016192109, url...

laurieburchell

Use from and to dates from collinfo.json if present

This affects Common Crawl. The current algorithm guesses the dates and gets it wrong by a week or two.

wumpus

cdx_toolkit
cdx_toolkit copied to clipboard

Metadata

feat: Adding Github action to check CC's CDX index server every week

feat: Adding filter_cdx and warc_by_cdx commands (2)

enhancement: select crawl by exact timestamp

Use from and to dates from collinfo.json if present

← Metadata

Owner

Metadata

cdx_toolkit cdx_toolkit copied to clipboard

Metadata

feat: Adding Github action to check CC's CDX index server every week

feat: Adding filter_cdx and warc_by_cdx commands (2)

enhancement: select crawl by exact timestamp

Use from and to dates from collinfo.json if present

← Metadata

Owner

Metadata

cdx_toolkit
cdx_toolkit copied to clipboard