cdxj-indexer icon indicating copy to clipboard operation
cdxj-indexer copied to clipboard

Recompress and Re-indexing Errors

Open logpanic opened this issue 5 years ago • 0 comments

We've run into two issues while trying to recompress and re-index some of our older ARCs.

1): When running warcio recompress IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz we get:

IQ04-CRAWL-16-20041020093524-00141-crawling003.archive.org.arc.gz could not be read as a WARC or ARC

Could anyone elaborate on what's going on here/suggest possible work around?

2): For some of the ARCs that are sucessfully recompressed, we get this error after running the cdxj-indexer:

UnicodeEncodeError: 'ascii' codec can't encode character '\xed' in position 403: ordinal not in range(128)

We've hand checked a few of these ARCs and it seems that the offending resource is always an image in binary. Any suggestions on how to move forward? I can also post the first error in warcio if that's more appropriate.

logpanic avatar Jan 30 '20 19:01 logpanic