cdxj-indexer
cdxj-indexer copied to clipboard
CDXJ Indexing of WARC/ARCs
`cfxj` is impacted by two DepreciationWarnings from upstream: https://github.com/StdCarrot/Py3AMF/issues/19 Probably no impact yet in Python 3.12, and no impact foreseen in 3.13, but always good to know ^^
Codebase needs to be adapted to cope with the fact the `cgi` is now deprecated since Python 3.11, and slated for removal in 3.13.
We've found some weird WARCs, looking like this: ``` WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/ WARC-Date: 2017-09-19T03:35:35Z WARC-IP-Address: 176.58.112.27 WARC-Payload-Digest: sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU WARC-Record-ID: Content-Type: application/http; msgtype=response Content-Length: 7026 19/Sep/2017:03:35:35 +0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET /wp-content/uploads/2015/05/i6d2e3jOCVVc-e1432221090328.jpg HTTP/1.1||...
When using the cdxj-indexer on a webpage that contains multiple different HTTP POST requests with the same response, the cdxj-indexer will only append the URL for the response record. This...
Hi, we are using this cdx-indexer tool and found out that while replaying our Wacz files in Replayweb.page player, sometimes certain resources were not found, while they were present in...
Hello! Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file?...
## Describe the bug All processing stops when there is a malformed url. ## Steps to reproduce the bug For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns: Traceback (most recent call...
Few Feature requests and/or requests for help using cdxj-indexer! --> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to...
Similar to the wayback indexer, this indexer doesn't produce a sorted CDX file so when you try to use it on pywb it fails to find links correctly. Just wondering...
We've run into two issues while trying to recompress and re-index some of our older ARCs. 1): When running `warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz` we get: ```bash IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read...