cdxj-indexer issues

DepreciationWarnings in `pyamf`

`cfxj` is impacted by two DepreciationWarnings from upstream: https://github.com/StdCarrot/Py3AMF/issues/19 Probably no impact yet in Python 3.12, and no impact foreseen in 3.13, but always good to know ^^

benoit74

'cgi' is deprecated and slated for removal in Python 3.13

1

Codebase needs to be adapted to cope with the fact the `cgi` is now deprecated since Python 3.11, and slated for removal in 3.13.

benoit74

Ways of handling problematic WARC records

1

We've found some weird WARCs, looking like this: ``` WARC/1.0 WARC-Type: response WARC-Target-URI: http://www.estiethirionphotography.co.za/2011/10/fransua-anne-louise-wedding/feed/ WARC-Date: 2017-09-19T03:35:35Z WARC-IP-Address: 176.58.112.27 WARC-Payload-Digest: sha1:ZQZJUQJW34BYM2R23SI7PDFMYFUTXGVU WARC-Record-ID: Content-Type: application/http; msgtype=response Content-Length: 7026 19/Sep/2017:03:35:35 +0000|v1|40.77.167.54|www.mobyaffiliates.com|200|17922|35.197.249.238:80|0.019|0.019|GET /wp-content/uploads/2015/05/i6d2e3jOCVVc-e1432221090328.jpg HTTP/1.1||...

anjackson

Revisit records with POST requests lack a POST append in their URL key

When using the cdxj-indexer on a webpage that contains multiple different HTTP POST requests with the same response, the cdxj-indexer will only append the URL for the response record. This...

ARiedijk

SURT are not created for HTTP CONNECT requests in WARC file

Hi, we are using this cdx-indexer tool and found out that while replaying our Wacz files in Replayweb.page player, sometimes certain resources were not found, while they were present in...

ARiedijk

Extracting page titles / URLs from cdxj

1

Hello! Apologies if this is a silly question, but I'm wondering if cdxj-indexer has the ability to generate a list of pages (and potentially their titles) from a warc file?...

jakebickford

Problem when URL is malformed

## Describe the bug All processing stops when there is a malformed url. ## Steps to reproduce the bug For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns: Traceback (most recent call...

PedroG1515

Feature Requests / questions on use --> Pipe, Readme

2

Few Feature requests and/or requests for help using cdxj-indexer! --> Also, my timing is good based on the reply by @ikreymer in another issue, seems we're both coming back to...

jwest75674

CDX files generated are not sorted

3

Similar to the wayback indexer, this indexer doesn't produce a sorted CDX file so when you try to use it on pywb it fails to find links correctly. Just wondering...

thomaspreece

Recompress and Re-indexing Errors

We've run into two issues while trying to recompress and re-index some of our older ARCs. 1): When running `warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz` we get: ```bash IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read...

logpanic

cdxj-indexer
cdxj-indexer copied to clipboard

Metadata

DepreciationWarnings in `pyamf`

'cgi' is deprecated and slated for removal in Python 3.13

Ways of handling problematic WARC records

Revisit records with POST requests lack a POST append in their URL key

SURT are not created for HTTP CONNECT requests in WARC file

Extracting page titles / URLs from cdxj

Problem when URL is malformed

Feature Requests / questions on use --> Pipe, Readme

CDX files generated are not sorted

Recompress and Re-indexing Errors

← Metadata

Owner

Metadata

cdxj-indexer cdxj-indexer copied to clipboard

Metadata

← Metadata

Owner

Metadata

cdxj-indexer
cdxj-indexer copied to clipboard